Document models, Part 1: Performance
From NeoWiki
- A look at features and performance of XML document models in Java
Dennis Sosnoski, President, Sosnoski Software Solutions, Inc.
01 Sep 2001
- In this article, Java consultant Dennis Sosnoski compares the performance and functionality of several Java document models. It's not always clear what the tradeoffs are when you choose a model, and it can require extensive recoding to switch if you later change your mind. Putting performance results in the context of feature sets and compliance with standards, the author gives some advice on making the right choice for your requirements. The articles includes several charts and the source code for the test suite
Java developers working with XML documents in memory can chose to use either a standard DOM representation or any of several Java-specific models. This flexibility has helped establish Java as a great platform for XML work. However, as the number of different models has grown it has become difficult to determine how the models compare in terms of features, performance, and ease of use.
This first article in a series on using XML and Java technologies looks at the features and performance of some of the leading XML document models in Java. It includes the results of a set of performance tests (with the test code available for download, see Resources). The second article in the series will look at the ease-of-use issue, comparing sample code used by the different models to accomplish the same tasks.
Contents |
Document models
The number of available document models in Java is growing all the time. For this article I've included the most popular models, along with a couple of choices that demonstrate especially intriguing features which may not yet be widely known or used. Given the increasing importance of XML Namespaces, I've included only models that support this feature. The models are listed below with brief introductions and version information.
Just to clarify the terminology used in this article:
- parser means the program that interprets the structure of an XML text document
- document representation means the data structures used by a program to work with the document in memory
- document model means a library and API that supports working with a document representation
Some XML applications don't need to use a document model at all. If your application can collect the information it needs from a single pass through the document, you can probably use a parser directly. This approach may require a little more work, but it will always give better performance than building a document representation in memory.
DOM
DOM (Document Object Model) is the official W3C standard for representing XML documents in a platform- and language-neutral manner. It serves as a good comparison for any Java-specific models. To make departing from the DOM standard worthwhile, Java-specific models should offer significant performance and/or ease-of-use advantages over Java DOM implementations.
The DOM definition makes very heavy use of interfaces and inheritance for the different components of an XML document. This gives developers the advantage of using a common interface for working with several different types of components, but it also adds some complexity to the API. Because the DOM is language neutral, the interfaces do not make use of common Java components such as the Collections classes.
This article covers two DOM implementations: Crimson and Xerces Java. Crimson is an Apache project based on the Sun Project X parser. It incorporates a full validating parser that includes support for DTDs. The parser is accessible through a SAX2 interface, and the DOM implementation can work with other SAX2 parsers. Crimson is open-source code released under the Apache license. The version used for the performance comparisons is Crimson 1.1.1 (0.2MB jar size), with the included SAX2 parser used for DOM construction from text files.
Xerces Java, the other DOM implementation tested, is another Apache project. Xerces was originally based on the IBM Java parser, commonly known as XML4J. (The redeveloped Xerces Java 2, currently in early beta, will eventually succeed it. The current version is sometimes called Xerces Java 1.) As with Crimson, the Xerces parser can be accessed through a SAX2 interface as well as through the DOM. However, Xerces does not provide any way to use a different SAX2 parser with the Xerces DOM. Xerces Java includes support for validation against both DTDs and XML Schema (with only the most minor limitations on the Schema support).
Xerces Java also supports a deferred node expansion mode for DOM (referred to in this article as Xerces deferred or Xerces def.), in which document components are initially represented with a compact format that is expanded to a full DOM representation only when used. The intent of this mode is to allow faster parsing and reduced memory usage, particularly for applications which may use only part of the input document. Like Crimson, Xerces is open-source code released under the Apache license. The version used for the performance comparisons is Xerces 1.4.2 (1.8MB jar size)
JDOM
JDOM is intended to be a Java-specific document model that makes interacting with XML simpler and faster than using DOM implementations. As the first such Java-specific model, JDOM has been heavily publicized and promoted. It is also being considered for eventual use as a Java Standard Extension through the Java Specification Request JSR-102. The actual form this will take is still under development, though, and the JDOM APIs have been undergoing significant changes between beta versions. JDOM has been under development since early 2000.
JDOM differs from DOM in two major respects. First JDOM uses only concrete classes rather than interfaces. This simplifies the APIs in some respects, but also limits flexibility. Second, the API makes extensive use of the Collections classes, simplifying use for Java developers who are already familiar with these classes.
JDOM's documentation states its goal as to "solve 80% (or more) of Java/XML problems with 20% (or less) of the effort" (with the 20% presumably referring to the learning curve). JDOM is certainly usable for the majority of Java/XML applications, and most developers find the API significantly easier to understand than DOM. JDOM also includes fairly extensive checks on program behavior in order to prevent the user from doing anything that does not make sense in XML. However, it still requires that you have a good understanding of XML to do anything beyond the basics (or even to understand the errors, in some cases). This is probably a more significant effort than learning either the DOM or JDOM interface.
JDOM itself does not include a parser. It normally uses a SAX2 parser for parsing and validating of an input XML document (though it can also take a previously constructed DOM representation as input). It includes converters to output a JDOM representation as a SAX2 event stream, a DOM model, or as an XML text document. JDOM is open-source code released under a variation of the Apache license. The version used for the performance comparisons is JDOM Beta 0.7 (0.1MB jar size), with the Crimson SAX2 parser used for building the JDOM representation from text files.
dom4j
dom4j originated as a kind of intellectual offshoot from JDOM, though it represents a completely separate development effort. It incorporates a number of features beyond the basic XML document representation, including integrated XPath support, XML Schema support (currently in alpha form), and event-based processing for very large or streamed documents. It also gives the option of building the document representation with parallel access through the dom4j APIs and a standard DOM interface. It has been under development since late 2000, with existing APIs preserved between recent releases.
In order to support all these features, dom4j uses an interface and abstract base class approach. dom4j makes extensive use of the Collections classes in the API, but it also provides alternatives in many cases to allow better performance or a more direct approach to coding. The net effect is that dom4j provides considerably greater flexibility than JDOM, although at the cost of a more complex API.
dom4j shares the JDOM goals of ease of use and intuitive operation for Java developers while adding the goals of flexibility, XPath integration, and very large document handling. It also aims to be a more complete solution than JDOM, with the goal of handling essentially all Java/XML problems. In doing this it has less emphasis on preventing incorrect application program behavior than JDOM.
dom4j uses the same approach to input and output as JDOM, relying on a SAX2 parser for the input handling and converters to handle output as a SAX2 event stream, a DOM model, or an XML text document. dom4j is open-source code released under a BSD-style license, which is essentially equivalent to the Apache-style licenses. The version used for the performance comparisons is dom4j 0.9 (0.4MB jar size), with the bundled AElfred SAX2 parser used for building the representation from text files (due to SAX2 option settings, one of the test files could not be handled by dom4j using the same Crimson SAX2 parser as used for the JDOM test).
Electric XML
Electric XML (EXML) is a spin-off from a commercial project supporting distributed computing. It differs from the other models discussed so far in that it properly supports only a subset of XML documents, it does not provide any support for validation, and it has a more restrictive license. However, EXML offers the advantages of very small size and direct support for an XPath subset, and it made an interesting candidate for this comparison since it has been promoted as an alternative to the other models in several recent articles.
EXML takes a similar approach to JDOM in avoiding the use of interfaces, though it achieves somewhat the same effect by using abstract base classes (the difference being mainly that interfaces provide greater flexibility for extending the implementation). It differs from JDOM in also avoiding the use of the Collections classes. This combination gives it a fairly simple API that in many respects resembles a flattened version of the DOM API, with the addition of XPath operations.
EXML preserves whitespace within a document only when the whitespace is adjacent to non-whitespace text content, a limitation that restricts EXML to a subset of XML documents. Standard XML requires that whitespace be preserved when a document is read unless the whitespace can be determined to be insignificant from the document DTD. The EXML approach works fine for many XML applications where whitespace is known to be insignificant in advance, but it prevents using EXML for documents that expect whitespace to be preserved (such as applications generating documents to be displayed by browsers or otherwise viewed). (See the sidebar Whitespace wishes for the author's modest proposal on this topic.)
This whitespace deletion can have a misleading effect on performance comparisons -- many types of tests scale proportional to the number of components in the document, and each whitespace sequence deleted by EXML is a component in other models. EXML is included in the results shown in this article, but keep this effect in mind when interpreting the performance differences.
EXML uses an integrated parser for building the document representation from a text document. It does not provide any means of converting to or from DOM or SAX2 event streams except by way of text. EXML is open-source code released by The Mind Electric under a restricted license that prohibits embedding it in certain types of applications or libraries. The version used for the performance comparisons is Electric XML 2.2 (0.05MB jar size).
XML Pull Parser
XML Pull Parser (XPP) is a recent development that demonstrates a different approach to XML parsing. As with EXML, XPP properly supports only a subset of XML documents and does not provide any support for validation. It does share the advantage of a very small size. That advantage, combined with its pull-parser approach, made it a good alternative to include in this comparison.
XPP uses interfaces almost exclusively, but it uses only a small number of classes in total. As with EXML, XPP avoids the use of the Collections classes in the API. Overall, it's the simplest document model API I included in this article.
The limitation that restricts XPP to a subset of XML documents is that it does not support entities, comments, or processing instructions in the document. XPP creates a document structure consisting only of elements, attributes (including Namespaces), and content text. This is a serious limitation for some types of applications. However, it does not generally have the same effect on performance comparisons as the EXML whitespace handling. Only one of the test files I used for this article was incompatible with XPP, and XPP results are shown in the charts with a note that this file was not included.
The pull-parser support in XPP (referred to in this article as XPP pull) works by actually postponing parsing until a component of the document is accessed, then parsing as much of the document as necessary to construct that component. This technique is intended to allow for very fast document-screening or classification applications, especially in cases where documents may need to be forwarded or otherwise disposed of rather than completely parsed and processed. Use of this approach is optional, and if you use XPP in non-pull mode it parses the entire document and builds the complete representation concurrently.
As with EXML, XPP uses an integrated parser for building the document representation from a text document, and it does not provide any means of converting to or from DOM or SAX2 event streams except by way of text. XPP is open source code with an Apache-style license. The version used for the performance comparisons is PullParser 2.0.1 Beta 8 (0.04MB jar size).
Performance comparisons
The performance comparisons used in this article are based on parsing and working with a set of selected XML documents intended to be representative of a wide range of applications:
- much_ado.xml, the Shakespeare play marked up as XML. No attributes and a fairly flat structure (202K bytes).
- periodic.xml, periodic table of the elements in XML. Some attributes, also fairly flat (117K bytes).
- soap1.xml, sample SOAP document taken from the specification. Heavy namespaces and attributes (0.4K bytes, repeated 49 times each test pass).
- soap2.xml, list of values in SOAP document form. Heavy on namespaces and attributes (134K bytes).
- nt.xml, the New Testament marked up as XML. No attributes and very flat structure, heavy text content (1047K bytes).
- xml.xml, the XML specification, with the DTD reference removed and all entities defined inline. Text-style markup with heavy mixed content, some attributes (160K bytes).
For more information about the test platform, see the Test details below and see Resources for a link to the source code used for the testing.
With the exception of the very small soap1.xml document, all measured times are per pass of a particular test on the document. In the case of soap1.xml, the measured time is for 49 consecutive passes of each test on the document (enough copies to total 20K bytes of text).
The test framework uses the approach of running one particular test some number of times (10 for the results shown) on a document, tracking the best and average times for that test, then moving to the next test on the same document. After the full sequence of tests has been completed on one document, it repeats the process with the next document. In order to prevent interactions between the document models, only one model is tested in each execution of the test framework.
Timing benchmarks using HotSpot and similar dynamically optimizing JVMs are notoriously tricky; small changes in the test sequence often cause large variations in timing results. I've found that this is especially true for average times executing a particular piece of code; the best times are much more consistent, and are the values I've shown in these results. You can see a comparison of average versus best times for the first test, Document build time.
Document build time
The document build time test looks at the time required to parse a text document and construct the document representation. I've included the SAX2 parse time using Crimson and Xerces SAX2 parsers in the chart for comparison purposes, since most of the document models (all except EXML and XPP) use a SAX2 parse-event stream as input to the document building process. Figure 1 depicts the test results.
XPP pull's build time is too fast to measure for most of the test documents (since the documents are not actually parsed in this case), only showing up for the very short soap1.xml. The pull-parser memory size and associated creation overhead make XPP appear relatively slow for this file. This is because the test program creates a new copy of the pull parser for each copy of the document being parsed. In the case of soap1.xml, 49 copies are used for each measured time. The overhead of allocating and initializing these parser instances is greater than the time required by most of the other approaches to repeatedly parse the text and build the document representations.
The author of XPP pointed out in an e-mail discussion that in a real application the pull parser instances could be pooled for reuse. If this were done, the apparent overhead for the soap1.xml file would drop to an insignificant level. Even without pooling, the pull-parser creation overhead is insignificant for larger files.
XPP (with full parse), Xerces with deferred node creation, and dom4j all show about equal performance overall in this test. Xerces deferred does especially well on the larger documents, but appears to have a high overhead for smaller documents -- much higher even than the normal Xerces DOM. The deferred-node creation approach also has a relatively high overhead on first use of portions of the document, which reduces the advantage of the fast parse.
Xerces in all forms (SAX2 parse, normal DOM, and deferred DOM) appears to have a high overhead for the small soap1.xml file. XPP (full parse) does especially well with this file, and even EXML beats the SAX2-based models on soap1.xml. Overall, though, EXML is the poorest performer in this test despite the advantage it receives from discarding isolated whitespace content.
Document walk time
The document walk time test looks at the time required to walk the constructed document representation, going through each element, attribute, and text content segment in document order. It's intended to represent the performance of the document model interfaces, which can be important for applications that repeatedly access information from a document once it's been parsed. Overall, the walk times are much faster than the parse times. For applications that make only a single pass through each parsed document, the parse times are going to be much more significant than the walk times. Figure 2 charts the results.
XPP is the best performer in this test by a considerable margin. Xerces DOM takes about twice as long as XPP. EXML takes almost three times as long as XPP, despite the advantage it receives from discarding isolated whitespace content in the documents. dom4j comes in near the middle of the range.
When XPP pull is used, parsing of the document text does not actually take place until the document representation is accessed. This results in a very large overhead the first time a document representation is walked (not shown in the chart). XPP shows a net loss of performance when the pull-parser approach is used if the entire document representation is later accessed. The total time required for these first two tests is greater for the pull parser than for using XPP with normal parsing (anywhere between 20 and 100 percent greater, depending on the document). However, the pull-parser approach can still give a substantial performance benefit when most of the documents being parsed are not accessed in full.
Xerces with deferred node creation shows similar behavior in that there's a performance penalty as the nodes in a document representation are accessed for the first time (not shown in the chart). However, in the case of Xerces the node-creation overhead is about the same as the difference in performance from normal DOM creation during the parse. For the larger documents, the total time required for the first two tests is about the same using Xerces deferred as it is using Xerces with normal DOM construction. If you're using Xerces on fairly large documents (perhaps 10KB or larger) the deferred node-creation option looks like a good choice.
Document modify time
This test, the results of which appear in Figure 3, looks at the time required to systematically modify the constructed document representation. It walks the representation, deleting all isolated whitespace content and wrapping each non-whitespace content string with a new, added, element. It also adds an attribute to each element of the original document that contained non-whitespace content. This test is intended to represent the performance of the document models across a range of modifications to the documents. As with the walk times, the modify times are considerably faster than the parse times. As a result, the parse times are going to be more important for applications that make only a single pass through each parsed document.
Figure 3. Document modify time'
EXML comes out ahead in this test, but it again has a performance advantage over the other models in that it always discards isolated whitespace content during the parse. This means that there are no deletions from the EXML representations during this test.
XPP takes a close second place in the modification performance, and, unlike EXML, the XPP test includes deletions. Xerces DOM and dom4j come in near the middle of the range, with JDOM and Crimson DOM models again giving the poorest performance.
Text generation time
This test looks at the time required to output document representations as text XML documents; results appear in Figure 4. This step is likely to be an important part of the overall performance for any applications that are not pure consumers of XML documents, especially since the time required to output the document as text is generally close to the parse time on document input. To make these times directly comparable, this test uses the original documents, not the modified versions generated by the previous test.
Figure 4. Text generation time
The text generation time test shows a narrower performance range than the previous tests, with Xerces DOM the best performer by a relatively small margin and JDOM the worst. EXML performance measures better than JDOM, but this is again biased by EXML discarding whitespace content.
Many of the models provide options to control text output formats, and some of the options are likely to affect the text generation time. This test uses only the most basic form of output for each model, so the results show only the default performance rather than the best possible performance.
Document memory size
This test looks at the memory space used for document representations. This can be a significant concern for developers who work with large documents, or multiple smaller documents concurrently. Figure 5 shows the results of this test.
Figure 5. Document memory size
The memory-size results differ from the timing tests in that the values shown for the small soap1.xml file represent a single copy of the file, rather than the 49 copies used in the timing measurements. In most of the models, the memory used for this brief document is too small to show on the scale of the chart.
With the exception of XPP pull (which does not actually build the document representation until it is accessed), the differences between the models are relatively small compared to the differences shown in some of the timing tests. Xerces deferred has the most compact representation (which expands to the base Xerces size as the representation is accessed for the first time), followed closely by dom4j. EXML has the least compact representation, despite discarding whitespace content included in the other models.
All the models are likely to require too much memory for very large documents since even the most compact take up approximately four times the raw document text size in bytes. XPP pull and dom4j offer the best support for very large documents by providing means of working with partial document representations. XPP pull does this by only building the portions of the representation which are actually accessed, while dom4j includes support for event-based handling to construct or process only a portion of the document at a time.
Java serialization
These tests measure the time and output size for Java serialization of the document representations. This is mainly a concern for applications that transfer a representation between Java programs using Java RMI (Remote Method Invocation), including EJB (Enterprise JavaBean) applications. I included only those models that support Java serialization in these tests. The three figures below show the results of this test.
Figure 6. Serialization output time
Figure 7. Serialization input time
Figure 8. Serialized document size
dom4j shows the best serialization performance for both output (generating the serialized form) and input (reconstructing the document from the serialized form), while Xerces DOM shows the poorest. EXML comes close to dom4j's time, but again EXML has an advantage in working with a considerably smaller number of objects in the representation because of the discarded whitespace content.
All performance -- both in time and size -- would be much better if the documents were output as text and parsed back in to reconstruct the documents, instead of using Java serialization. The structure of the XML document representations as large numbers of unique small objects is the problem here. Java serialization does not handle this type of structure efficiently, resulting in high overhead in both time and output size.
You can design serialized forms for documents that are smaller than the text representations and faster than text to input and output, but only by bypassing Java serialization. (I have a project implementing such a customized serialization for XML documents available as open-source code on my company's Web site, see Resources.)
Conclusions
The different Java XML document models all have some areas of strength, but from the performance standpoint there are some clear winners.
XPP is the performance leader in most respects. For middleware-type applications that do not require validation, entities, processing instructions, or comments, XPP looks to be an excellent choice despite its newness. This is especially true for applications running as browser applets or in limited memory environments.
dom4j doesn't have the sheer speed of XPP, but it does provide very good performance with a much more standardized and fully functional implementation, including built-in support for SAX2, DOM, and even XPath. Xerces DOM (with deferred node creation) also does well on most performance measurements, though it suffers on small files and Java serialization. For general XML handling, both dom4j and Xerces DOM are probably good choices, with the preference between the two determined by whether you consider Java-specific features or cross-language compatibility more important.
JDOM and Crimson DOM consistently rank poorly on the performance tests. Crimson DOM may still be worth using in the case of small documents, where Xerces does poorly. JDOM doesn't really have anything to recommend it from the performance standpoint, though the developers have said they intend to focus on performance before the official release. However, it'll probably be difficult for JDOM to match the performance of the other models without some restructuring of the API.
EXML is very small (in jar file size) and does well in some of the performance tests. Even with the advantage of deleting isolated whitespace content, though, EXML does not match XPP performance. Unless you need one of the features EXML supports but that XPP lacks, XPP is probably a better choice in limited-memory environments.
Currently none of the models can offer good performance for Java serialization, though dom4j does the best. If you need to transfer a document representation between programs, generally your best alternative is to write the document out as text and parse it back in to reconstruct the representation. Custom serialization formats may offer a better alternative in the future.
Coming next ...
I've covered the basic features of the document models and presented performance measurements for several types of operations on documents. Keep in mind, though, that performance is only one issue in the choice of document model. Usability is at least as important for most developers, and these models use different APIs that may offer reasons to prefer one over the others.
Usability will be the focus in the follow-up article, where I'll compare sample code used to accomplish the same operations in these different models. Check back for the second part of this comparison. While you're waiting, you can share your comments and questions on this article in the discussion forum, linked below.
Resources
- Participate in the discussion forum.
- If you need background, try the developerWorks XML programming in Java, the Understanding SAX tutorial, and the Understanding DOM tutorial.
- Download the test program and document model libraries used for this article on the download page.
- Check for updated tests and test results at the home page for the test program.
- Research or download the Java XML document models discussed in this article:
- IBM WebSphere Application Server includes the XML4J parser, which is based on Xerces Java. Find detailed how-to information about the product's XML support in the WAS Advanced edition 3.0 online documentation.
About the author
Dennis Sosnoski is the founder and lead consultant of Seattle-area Java consulting company Sosnoski Software Solutions, Inc. His professional software development experience spans over 30 years, with the last several years focused on server-side Java technologies including servlets, Enterprise JavaBeans, and XML. He's given a number of presentations on both Java performance issues and general server-side Java technologies, and chairs the Seattle Java-XML SIG.