Document models, Part 1: Performance
From NeoWiki
|  (→Document models) |  (→Performance comparisons) | ||
| Line 80: | Line 80: | ||
| ==Performance comparisons== | ==Performance comparisons== | ||
| + | |||
| + | The performance comparisons used in this article are based on parsing and working with a set of selected XML documents intended to be representative of a wide range of applications: | ||
| + | |||
| + | * much_ado.xml, the Shakespeare play marked up as XML. No attributes and a fairly flat structure (202K bytes). | ||
| + | * periodic.xml, periodic table of the elements in XML. Some attributes, also fairly flat (117K bytes). | ||
| + | * soap1.xml, sample SOAP document taken from the specification. Heavy namespaces and attributes (0.4K bytes, repeated 49 times each test pass). | ||
| + | * soap2.xml, list of values in SOAP document form. Heavy on namespaces and attributes (134K bytes). | ||
| + | * nt.xml, the New Testament marked up as XML. No attributes and very flat structure, heavy text content (1047K bytes). | ||
| + | * xml.xml, the XML specification, with the DTD reference removed and all entities defined inline. Text-style markup with heavy mixed content, some attributes (160K bytes). | ||
| + | |||
| + | For more information about the test platform, see the Test details below and see Resources for a link to the source code used for the testing. | ||
| + | |||
| + | With the exception of the very small soap1.xml document, all measured times are per pass of a particular test on the document. In the case of soap1.xml, the measured time is for 49 consecutive passes of each test on the document (enough copies to total 20K bytes of text). | ||
| + | |||
| + | The test framework uses the approach of running one particular test some number of times (10 for the results shown) on a document, tracking the best and average times for that test, then moving to the next test on the same document. After the full sequence of tests has been completed on one document, it repeats the process with the next document. In order to prevent interactions between the document models, only one model is tested in each execution of the test framework. | ||
| + | |||
| + | Timing benchmarks using HotSpot and similar dynamically optimizing JVMs are notoriously tricky; small changes in the test sequence often cause large variations in timing results. I've found that this is especially true for average times executing a particular piece of code; the best times are much more consistent, and are the values I've shown in these results. You can see a comparison of average versus best times for the first test, Document build time. | ||
Revision as of 15:01, 9 April 2007
- A look at features and performance of XML document models in Java
Dennis Sosnoski, President, Sosnoski Software Solutions, Inc.
01 Sep 2001
- In this article, Java consultant Dennis Sosnoski compares the performance and functionality of several Java document models. It's not always clear what the tradeoffs are when you choose a model, and it can require extensive recoding to switch if you later change your mind. Putting performance results in the context of feature sets and compliance with standards, the author gives some advice on making the right choice for your requirements. The articles includes several charts and the source code for the test suite
Java developers working with XML documents in memory can chose to use either a standard DOM representation or any of several Java-specific models. This flexibility has helped establish Java as a great platform for XML work. However, as the number of different models has grown it has become difficult to determine how the models compare in terms of features, performance, and ease of use.
This first article in a series on using XML and Java technologies looks at the features and performance of some of the leading XML document models in Java. It includes the results of a set of performance tests (with the test code available for download, see Resources). The second article in the series will look at the ease-of-use issue, comparing sample code used by the different models to accomplish the same tasks.
| Contents | 
Document models
The number of available document models in Java is growing all the time. For this article I've included the most popular models, along with a couple of choices that demonstrate especially intriguing features which may not yet be widely known or used. Given the increasing importance of XML Namespaces, I've included only models that support this feature. The models are listed below with brief introductions and version information.
Just to clarify the terminology used in this article:
- parser means the program that interprets the structure of an XML text document
- document representation means the data structures used by a program to work with the document in memory
- document model means a library and API that supports working with a document representation
Some XML applications don't need to use a document model at all. If your application can collect the information it needs from a single pass through the document, you can probably use a parser directly. This approach may require a little more work, but it will always give better performance than building a document representation in memory.
DOM
DOM (Document Object Model) is the official W3C standard for representing XML documents in a platform- and language-neutral manner. It serves as a good comparison for any Java-specific models. To make departing from the DOM standard worthwhile, Java-specific models should offer significant performance and/or ease-of-use advantages over Java DOM implementations.
The DOM definition makes very heavy use of interfaces and inheritance for the different components of an XML document. This gives developers the advantage of using a common interface for working with several different types of components, but it also adds some complexity to the API. Because the DOM is language neutral, the interfaces do not make use of common Java components such as the Collections classes.
This article covers two DOM implementations: Crimson and Xerces Java. Crimson is an Apache project based on the Sun Project X parser. It incorporates a full validating parser that includes support for DTDs. The parser is accessible through a SAX2 interface, and the DOM implementation can work with other SAX2 parsers. Crimson is open-source code released under the Apache license. The version used for the performance comparisons is Crimson 1.1.1 (0.2MB jar size), with the included SAX2 parser used for DOM construction from text files.
Xerces Java, the other DOM implementation tested, is another Apache project. Xerces was originally based on the IBM Java parser, commonly known as XML4J. (The redeveloped Xerces Java 2, currently in early beta, will eventually succeed it. The current version is sometimes called Xerces Java 1.) As with Crimson, the Xerces parser can be accessed through a SAX2 interface as well as through the DOM. However, Xerces does not provide any way to use a different SAX2 parser with the Xerces DOM. Xerces Java includes support for validation against both DTDs and XML Schema (with only the most minor limitations on the Schema support).
Xerces Java also supports a deferred node expansion mode for DOM (referred to in this article as Xerces deferred or Xerces def.), in which document components are initially represented with a compact format that is expanded to a full DOM representation only when used. The intent of this mode is to allow faster parsing and reduced memory usage, particularly for applications which may use only part of the input document. Like Crimson, Xerces is open-source code released under the Apache license. The version used for the performance comparisons is Xerces 1.4.2 (1.8MB jar size)
JDOM
JDOM is intended to be a Java-specific document model that makes interacting with XML simpler and faster than using DOM implementations. As the first such Java-specific model, JDOM has been heavily publicized and promoted. It is also being considered for eventual use as a Java Standard Extension through the Java Specification Request JSR-102. The actual form this will take is still under development, though, and the JDOM APIs have been undergoing significant changes between beta versions. JDOM has been under development since early 2000.
JDOM differs from DOM in two major respects. First JDOM uses only concrete classes rather than interfaces. This simplifies the APIs in some respects, but also limits flexibility. Second, the API makes extensive use of the Collections classes, simplifying use for Java developers who are already familiar with these classes.
JDOM's documentation states its goal as to "solve 80% (or more) of Java/XML problems with 20% (or less) of the effort" (with the 20% presumably referring to the learning curve). JDOM is certainly usable for the majority of Java/XML applications, and most developers find the API significantly easier to understand than DOM. JDOM also includes fairly extensive checks on program behavior in order to prevent the user from doing anything that does not make sense in XML. However, it still requires that you have a good understanding of XML to do anything beyond the basics (or even to understand the errors, in some cases). This is probably a more significant effort than learning either the DOM or JDOM interface.
JDOM itself does not include a parser. It normally uses a SAX2 parser for parsing and validating of an input XML document (though it can also take a previously constructed DOM representation as input). It includes converters to output a JDOM representation as a SAX2 event stream, a DOM model, or as an XML text document. JDOM is open-source code released under a variation of the Apache license. The version used for the performance comparisons is JDOM Beta 0.7 (0.1MB jar size), with the Crimson SAX2 parser used for building the JDOM representation from text files.
dom4j
dom4j originated as a kind of intellectual offshoot from JDOM, though it represents a completely separate development effort. It incorporates a number of features beyond the basic XML document representation, including integrated XPath support, XML Schema support (currently in alpha form), and event-based processing for very large or streamed documents. It also gives the option of building the document representation with parallel access through the dom4j APIs and a standard DOM interface. It has been under development since late 2000, with existing APIs preserved between recent releases.
In order to support all these features, dom4j uses an interface and abstract base class approach. dom4j makes extensive use of the Collections classes in the API, but it also provides alternatives in many cases to allow better performance or a more direct approach to coding. The net effect is that dom4j provides considerably greater flexibility than JDOM, although at the cost of a more complex API.
dom4j shares the JDOM goals of ease of use and intuitive operation for Java developers while adding the goals of flexibility, XPath integration, and very large document handling. It also aims to be a more complete solution than JDOM, with the goal of handling essentially all Java/XML problems. In doing this it has less emphasis on preventing incorrect application program behavior than JDOM.
dom4j uses the same approach to input and output as JDOM, relying on a SAX2 parser for the input handling and converters to handle output as a SAX2 event stream, a DOM model, or an XML text document. dom4j is open-source code released under a BSD-style license, which is essentially equivalent to the Apache-style licenses. The version used for the performance comparisons is dom4j 0.9 (0.4MB jar size), with the bundled AElfred SAX2 parser used for building the representation from text files (due to SAX2 option settings, one of the test files could not be handled by dom4j using the same Crimson SAX2 parser as used for the JDOM test).
Electric XML
Electric XML (EXML) is a spin-off from a commercial project supporting distributed computing. It differs from the other models discussed so far in that it properly supports only a subset of XML documents, it does not provide any support for validation, and it has a more restrictive license. However, EXML offers the advantages of very small size and direct support for an XPath subset, and it made an interesting candidate for this comparison since it has been promoted as an alternative to the other models in several recent articles.
EXML takes a similar approach to JDOM in avoiding the use of interfaces, though it achieves somewhat the same effect by using abstract base classes (the difference being mainly that interfaces provide greater flexibility for extending the implementation). It differs from JDOM in also avoiding the use of the Collections classes. This combination gives it a fairly simple API that in many respects resembles a flattened version of the DOM API, with the addition of XPath operations.
EXML preserves whitespace within a document only when the whitespace is adjacent to non-whitespace text content, a limitation that restricts EXML to a subset of XML documents. Standard XML requires that whitespace be preserved when a document is read unless the whitespace can be determined to be insignificant from the document DTD. The EXML approach works fine for many XML applications where whitespace is known to be insignificant in advance, but it prevents using EXML for documents that expect whitespace to be preserved (such as applications generating documents to be displayed by browsers or otherwise viewed). (See the sidebar Whitespace wishes for the author's modest proposal on this topic.)
This whitespace deletion can have a misleading effect on performance comparisons -- many types of tests scale proportional to the number of components in the document, and each whitespace sequence deleted by EXML is a component in other models. EXML is included in the results shown in this article, but keep this effect in mind when interpreting the performance differences.
EXML uses an integrated parser for building the document representation from a text document. It does not provide any means of converting to or from DOM or SAX2 event streams except by way of text. EXML is open-source code released by The Mind Electric under a restricted license that prohibits embedding it in certain types of applications or libraries. The version used for the performance comparisons is Electric XML 2.2 (0.05MB jar size).
XML Pull Parser
XML Pull Parser (XPP) is a recent development that demonstrates a different approach to XML parsing. As with EXML, XPP properly supports only a subset of XML documents and does not provide any support for validation. It does share the advantage of a very small size. That advantage, combined with its pull-parser approach, made it a good alternative to include in this comparison.
XPP uses interfaces almost exclusively, but it uses only a small number of classes in total. As with EXML, XPP avoids the use of the Collections classes in the API. Overall, it's the simplest document model API I included in this article.
The limitation that restricts XPP to a subset of XML documents is that it does not support entities, comments, or processing instructions in the document. XPP creates a document structure consisting only of elements, attributes (including Namespaces), and content text. This is a serious limitation for some types of applications. However, it does not generally have the same effect on performance comparisons as the EXML whitespace handling. Only one of the test files I used for this article was incompatible with XPP, and XPP results are shown in the charts with a note that this file was not included.
The pull-parser support in XPP (referred to in this article as XPP pull) works by actually postponing parsing until a component of the document is accessed, then parsing as much of the document as necessary to construct that component. This technique is intended to allow for very fast document-screening or classification applications, especially in cases where documents may need to be forwarded or otherwise disposed of rather than completely parsed and processed. Use of this approach is optional, and if you use XPP in non-pull mode it parses the entire document and builds the complete representation concurrently.
As with EXML, XPP uses an integrated parser for building the document representation from a text document, and it does not provide any means of converting to or from DOM or SAX2 event streams except by way of text. XPP is open source code with an Apache-style license. The version used for the performance comparisons is PullParser 2.0.1 Beta 8 (0.04MB jar size).
Performance comparisons
The performance comparisons used in this article are based on parsing and working with a set of selected XML documents intended to be representative of a wide range of applications:
- much_ado.xml, the Shakespeare play marked up as XML. No attributes and a fairly flat structure (202K bytes).
- periodic.xml, periodic table of the elements in XML. Some attributes, also fairly flat (117K bytes).
- soap1.xml, sample SOAP document taken from the specification. Heavy namespaces and attributes (0.4K bytes, repeated 49 times each test pass).
- soap2.xml, list of values in SOAP document form. Heavy on namespaces and attributes (134K bytes).
- nt.xml, the New Testament marked up as XML. No attributes and very flat structure, heavy text content (1047K bytes).
- xml.xml, the XML specification, with the DTD reference removed and all entities defined inline. Text-style markup with heavy mixed content, some attributes (160K bytes).
For more information about the test platform, see the Test details below and see Resources for a link to the source code used for the testing.
With the exception of the very small soap1.xml document, all measured times are per pass of a particular test on the document. In the case of soap1.xml, the measured time is for 49 consecutive passes of each test on the document (enough copies to total 20K bytes of text).
The test framework uses the approach of running one particular test some number of times (10 for the results shown) on a document, tracking the best and average times for that test, then moving to the next test on the same document. After the full sequence of tests has been completed on one document, it repeats the process with the next document. In order to prevent interactions between the document models, only one model is tested in each execution of the test framework.
Timing benchmarks using HotSpot and similar dynamically optimizing JVMs are notoriously tricky; small changes in the test sequence often cause large variations in timing results. I've found that this is especially true for average times executing a particular piece of code; the best times are much more consistent, and are the values I've shown in these results. You can see a comparison of average versus best times for the first test, Document build time.