|IBM home | Products & services | Support & downloads | My account|
|XML and Java technologies: Data binding, Part 2: Performance|
Part 1 provides background on why you'd want to use data binding for XML, along with an overview of the available Java frameworks for data binding. If you haven't already read Part 1, you'll probably want to at least glance over it now. In this part I'm going straight to the issue of performance without further discussion of the whys and hows!
Note that the airport name information in Listing 1 usually is a single line of code. To accomodate column size, some lines of code are split and appear on two lines.
In addition to the compact format, I also tried a variation with more use of child elements for data values (only staying with attributes for IDs and IDREFs). Here's the same data presented in that format, which I refer to here as the full format:Listing 2. Full document format
Often, the relative performance of XML frameworks differs greatly depending on the size of documents being used, so I included both large and small documents in these performance tests. The large documents (time-comp.xml and time-full.xml) use identical data values in the two different formats shown above. Because of this, the sizes are considerably different (106 KB for the compact format versus 211 KB for the full format). The small documents are in collections, each containing 34 documents ranging in size from 1.4-3.3 KB for the compact format (ttcomp) and 2.2-5.8 KB for the full format (ttfull). As with the large documents, corresponding documents in the small document collections contain the same data values. The full set of documents used in the tests is available from the Downloads page (see Resources).
I would prefer to test with more document variations than just the two
formats used for these results. However, the amount of effort involved in
adding more documents for a data binding test is substantial because of
the need to provide W3C XML Schema (Schema) and Document Type Definition
(DTD) descriptions for code generation, along with mapping files and base
classes for the mapped versions. The two formats used here, with both
large and small document variations, should at least give a fairly
representative picture of how the data binding alternatives perform for
typical business documents. They probably allow the mapped binding
approaches to show better memory usage than would be typical of general
documents, though, because most of the data values in these documents can
be converted to primitive types. This results in a very compact internal
representation. With documents where most of the data values need to be
All test results were obtained using a 1.4GHz Athlon system with 256MB of DDR RAM, running RedHat Linux 7.2. I used Sun's JDK 1.4.1 for Linux in all tests. The specific versions of each data binding framework tested are as follows: JAXB Beta 1, Castor 0.9.4.1, JBind 1.0 Beta 12/07, Quick 4.3.1, and Zeus Beta 3.5 (JiBX is a special case -- see So what's JiBX? following the test results for details). All tests except JBind and JiBX used the Piccolo SAX2 parser, version 1.0.3. This is the fastest SAX2 parser I'm aware of, and generally meets or beats the speed of the XMLPull parser used for the JiBX tests (XPP3 version 1.1.2). JBind was unable to work with the Piccolo parser, so for testing JBind I used Xerces Java 2, version 2.2.0.
To provide a performance comparison between data binding and other alternative approaches I also ran a timing test of the same files using just the SAX2 parser, and timing and memory tests using the dom4j document model (a performance leader among the document models, and one that allows different SAX2 parsers to be used for parsing input documents). For these tests, I used dom4j version 1.3.
I used the same basic framework for these timing and memory usage tests as in my earlier tests with document models (see the author's document model performance article in Resources). This benchmark framework first reads all documents into internal memory buffers, then times multiple passes of input and output operations on the documents. The test results shown in Input timings and Output timings are the best times over several passes. This should be representative of long-term performance in a server-type environment where the same code is executed repeatedly.
dom4j is able to construct its in-memory representation of the documents in less than twice the amount of time taken by the parser alone. The only data binding framework that beats this performance is JiBX. JAXB, Quick, and Zeus all turn in respectable performance figures compared to dom4j, but take nearly twice as long as JiBX overall. Castor is very slow by comparison, both with mapped bindings and with generated code.
JBind performs a full order of magnitude slower than most of the binding frameworks in these tests. A small part of this poor performance is due to the slower parser used for the JBind tests (because it failed to work with the parser used for the other tests). A larger part is probably due to JBind forcing document validation against the Schema on input, which can add considerable overhead. Most of the poor performance is probably attributable to the JBind framework itself, though, which uses a very indirect approach to binding (building on top of a DOM document model, in the current implementation).
All the tests except for JBind were run without full validation. Most of the data binding frameworks include a certain inherent level of validation (assuring, for instance, that the content model of elements is matched) just by their design. Most can also use validating parsers (such as Xerces Java 2) for full checking of documents on input, and some (including JAXB) can perform full validation of bound data in memory. Since the main concern in these tests was performance, I disabled optional validation wherever possible (including using both property file and unmarshaller/marshaller settings in Castor).
dom4j offers better performance than any of the data binding approaches in this area, beating JiBX by a smidgen and Zeus by not much more. The other data binding frameworks take about twice as long, with Quick the slowest of all (no pun intended, of course). There's not nearly as much variation here as in the input tests, though the fact that dom4j does better than any of the data binding frameworks suggests that they all still have room for improvement.
The differences here are much larger than in the time performance comparisons, and show a very different pattern. While dom4j performed well in the time measurements, in terms of memory usage it's much worse than any of the data binding frameworks (except for JBind, which builds on an internal document model equivalent to dom4j's representation). Compared to the best performers in this area, dom4j takes more than 10 times the memory to represent the same data.
The two mapped binding approaches use the same internal structure for
the bound data, so they show identical memory usage. This gives them a tie
for first place in the memory efficiency arena, turning in a performance
several times better than the data binding approaches using generated
code. This is partially because the mapped binding uses a very compact
representation for data values. The mapped binding converts most of them
Besides the more extensive use of primitive values in the mapped bindings, another reason for the greater memory efficiency of this approach is that generated code approaches usually add control information to the actual data present in each bound object. This control information pads the size of the objects, reducing one of the main benefits of data binding.
The data binding frameworks using generated code consume at least
several times the memory of the mapped bindings in these tests, but (with
the exception of JBind) are still much smaller than dom4j's document model
representation. This is no surprise -- a document model such as dom4j
needs to construct objects to represent every component of the document
(including the actual data text, along with structure components such as
elements and attributes), while the data bindings only need to hold the
actual data. Much of that data is still stored as
Zeus is the only data binding approach considered here that directly
stores all data as
Figure 7 shows the amount of time -- from when the benchmark program starts executing until after the round-trip operation returns (unmarshalling to objects, then marshalling the objects back out to a document) -- on a single short document. The difference from the previous timing figures is that here most of the time is spent in classloading and native code generation by the JVM for the data binding framework code. By comparing these results with the earlier timing charts, you can see that this startup time is generally several times larger than the actual processing time for even a fairly large document. If you're only working with a few documents per execution of your program, this startup time is going to be a more significant factor than the best case times shown earlier.
The size of the jar files used by the data binding framework is one major influence on this startup time. JiBX is the smallest, with a total size of less than 60KB for the runtime and parser. JAXB, Castor, and JBind are the largest, weighing in at roughly 1MB each. The time is also affected by the initialization required for each framework. In the case of Castor with a mapped binding this includes processing the mapping definition file, and for JBind it includes processing the Schema definition for the document.
So what's JiBX?
JiBX actually originated from this series of articles. When I began looking at the available data binding frameworks I was surprised to see that they didn't perform all that well compared with document models such as dom4j. This was contrary to my expectations, since the data binding approach actually reduces the amount of document information kept in memory -- a document model holds on to everything, while a data binding only needs the actual data. I thought that an approach that works with less data should generally be faster than one that works with more.
In looking at how the existing data binding frameworks operate, I saw two aspects that didn't look good from a performance standpoint. The first was extensive use of reflection in many of the frameworks. Reflection is a way of accessing information about a Java language class at runtime. It can be used to access fields and methods in instances of a class, giving a way of dynamically hooking together classes at runtime without the need for any source code links between the classes. Reflection is a very powerful Java Technology feature, but suffers a performance disadvantage when compared to calling a method or accessing a field directly in compiled code.
The second aspect I questioned was the use of a SAX2 parser for unmarshalling documents. SAX2 is a very useful standard for parsing XML, but its event driven approach is not well suited to data binding and similar applications. The problem here is that the code processing the SAX2 events needs to maintain state information for everything it processes, and this adds both complexity and overhead.
I created the code that grew into JiBX to test some ways around these problematic aspects of the other data binding frameworks, and to experiment with extending the mapped binding approach beyond what's supported by Castor. Instead of reflection, JiBX uses byte code enhancement to add hooks into application code at project build time. Instead of SAX2, JiBX is based on a pull parser architecture (currently XMLPull). Rather than generating code from a DTD or Schema, JiBX works with a binding definition that associates user-supplied classes with XML structure.
These techniques are not unique to JiBX. Byte code enhancement is used by many JDO (Java Data Objects) implementations for basically the same purpose as in JiBX (to add access hooks to existing compiled code). The original JAXB code (since discarded) was based on a pull parser architecture similar to XMLPull. The mapped approach to data binding is supported (although with some limitations) by both Castor and Quick. Even though the individual techniques aren't new, the combination of them still makes for a very interesting alternative to the other data binding frameworks.
I'll give a full rundown on JiBX in Part 3 of this article. JiBX is still at an early development stage. For the performance tests, I hand wrote the code that would normally be added through byte code enhancement and ran it using the then-current version of the JiBX runtime. As of this article going to publication, I'm still wrapping up the enhancement code, and there are a number of other features I'd love to see added. If you can't wait until Part 3 to find out more about JiBX, check Resources for a link to the JiBX site. You can even start contributing to the future development of JiBX, as well as making use of JiBX in your own applications.
JAXB still looks like a good choice for the code generation approach in the future (the beta license only allows evaluation use). The current reference implementation beta is both bulky in terms of jar size and somewhat inefficient in terms of memory usage, but here again you may see better performance in the future. As of this writing, the current version is still a beta, and even after it's released commercial or open source projects may improve performance over the reference implementation. Since it will be a standard part of the J2EE platform, JAXB is definitely going to play an important role in working with XML and Java technologies.
The performance results also confirm the use of JBind, Quick, and Zeus
as most appropriate for applications with special requirements rather than
for general usage. JBind's XML Code approach can provide a great basis for
an application built around processing of an XML document, but the
performance of the current implementation is liable to be a problem. Quick
and Zeus offer code generation from DTDs, but as I mentioned in Part 1,
it's generally pretty easy to convert DTDs to Schemas. On the downside,
Quick seems overly complex to use and Zeus supports only
For mapped approaches to data binding, Castor has the advantage of a fairly stable implementation and substantial real-world usage. Quick can be used for this type of binding as well, but again seems complex to set up. JiBX is new and not yet in full usage, but offers excellent performance along with a high degree of flexibility.
If you haven't read Part 1, you may want to refer back to it to learn more about the features of these data binding frameworks. Part 1 also discusses the tradeoffs between code generation and mapped approaches to data binding. In Part 3, I'll present the new JiBX framework in more depth. This includes how JiBX maps Java objects to XML, along with the byte code enhancement process JiBX uses at build time to minimize runtime overhead. Check back for full details on this exciting approach to pumping up framework performance!