IBM Skip to main content
Search for:   within 
      Search help  
     IBM home  |  Products & services  |  Support & downloads   |  My account

developerWorks > Java technology | XML
developerWorks
Parsing, indexing, and searching XML with Digester and Lucene
103 KBe-mail it!
Contents:
About Digester and Lucene
Using Digester to parse XML
Using Lucene to index text
Marrying Digester and Lucene
Using Lucene to search text
Conclusion
Resources
About the author
Rate this article
Related content:
XML Matters: Indexing XML documents
Subscriptions:
dW newsletters
dW Subscription
(CDs and downloads)
These open source projects can ease your XML-handling tasks

Level: Intermediate

Otis Gospodnetic (mailto:otis@apache.org?cc=&subject=Parsing, indexing, and searching XML with Digester and Lucene)
Software Engineer, Wireless Generation, Inc.
3 June 2003

Java developers can use the SAX interface to parse XML documents, but this process is rather complex. Digester and Lucene, two open source projects from the Apache Foundation, cut down your development time for projects in which you manipulate XML. Lucene developer Otis Gospodnetic shows you how it's done, with example code that you can compile and run.

If you've ever wanted to parse XML documents but have found SAX just a little difficult, this article is for you. In this article, we examine how to use two open source tools from the Apache Jakarta project, Commons Digester and Lucene, to handle the parsing, indexing, and searching of XML documents. Digester parses the XML data, and Lucene handles indexing and searching. You'll first see how to use each tool on its own and then how to use them together, with sample code that you can compile and run.

About Digester and Lucene
Commons Digester is a subproject of the Commons project, which is one of the initiatives developed by the community of developers who create open source software under the Apache Jakarta umbrella. Digester offers a simple and high-level interface for the mapping of XML documents to Java objects. When Digester finds developer-defined patterns in XML, it will take developer-specified actions. Digester requires a few additional Java libraries, including an XML parser compatible with either SAX 2.0 or JAXP 1.1. Digester's home page, listed in the Resources section at the end of this article, provides a short list of the libraries that Digester needs.

Lucene is another Apache Jakarta project. Like Digester, it is a Java library and not a stand-alone application. Behind its simple indexing and search interface hides an elegant piece of software capable of handling many documents.

In the rest of this article, we use Digester to parse a simple XML file, then illustrate how Lucene creates indices. Then we marry the two tools to create a Lucene-generated index from our sample XML document, and finally use Lucene classes to search through that index.

Using Digester to parse XML
We use Digester to parse the simple XML document in Listing 1, which contains entries in an imaginary address book. To demonstrate handling of elements with and without attributes, I decided to make type an attribute of the <contact> element, while leaving all other elements without any attributes.

Listing 1. XML snippet of a fictitious address book

<?xml version='1.0' encoding='utf-8'?>
<address-book>
    <contact type="individual">
        <name>Zane Pasolini</name>
        <address>999 W. Prince St.</address>
        <city>New York</city>
        <province>NY</province>
        <postalcode>10013</postalcode>
        <country>USA</country>
        <telephone>1-212-345-6789</telephone>
    </contact>
    <contact type="business">
        <name>SAMOFIX d.o.o.</name>
        <address>Ilica 47-2</address>
        <city>Zagreb</city>
        <province></province>
        <postalcode>10000</postalcode>
        <country>Croatia</country>
        <telephone>385-1-123-4567</telephone>
    </contact>
</address-book>

Using Digester to parse the above XML document is very simple, as Listing 2 illustrates. (Clicking Listing 2 causes a new browser window to open. Keep that window open so you can refer to Listing 2 while reading the following discussion.)

The most involved part of using Digester is centralized in the main() method. After creating an instance of Digester, we have to create rules for actions that are to be triggered when certain patterns are encountered in the XML document that we are parsing. We'll look in more detail at each Digester rule that we defined in Listing 2. Note that the order in which rules are passed to Digester matters a great deal.

The first rule tells Digester to create an instance of the AddressBookParser class when the pattern "address-book" is found. Because <address-book> is the first element in the address book XML file, this rule will be the first to be triggered when we use Digester with our XML file.


digester.addObjectCreate("address-book", AddressBookParser.class);

This next rule instructs Digester to create an instance of class Contact when it finds the <contact> child element under the <address-book> parent.


digester.addObjectCreate("address-book/contact", Contact.class);

In the following snippet, we set the type property of the Contact instance when Digester finds the type attribute of the <contact> element.


digester.addSetProperties("address-book/contact", "type", "type");

Our AddressBookParser class contains several rules that look similar to the one shown below. They instruct Digester to invoke the setName() method of the Contact class instance and use the value enclosed by <name> elements as the method parameter.


digester.addCallMethod("address-book/contact/name", "setName", 0);

Finally, this rule tells Digester to call the addContact() method when it finds the closing </contact> element.


digester.addSetNext("address-book/contact", "addContact");

Again, it's important that you consider the order in which the rules are passed to Digester. While we could change the order of various addSetProperties() rules in our class and still have properly functioning code, switching the order of addObjectCreate() and addSetNext() would result in an error.

Using Lucene to index text
There are four fundamental Lucene classes for indexing text: IndexWriter, Analyzer, Document, and Field.

The IndexWriter class creates a new index and adds documents to an existing index.

Before text is indexed, it is passed through an Analyzer. Analyzer classes are in charge of extracting indexable tokens out of text to be indexed and eliminating the rest. Lucene comes with a few different Analyzer implementations. Some of them deal with skipping stop words (frequently used words that don't help distinguish one document from the other, such as a, an, the, in, and on), for instance, while others deal with converting all tokens to lowercase letters, so that searches are not case sensitive.

An index consists of a set of Documents, and each Document consist of one or more Fields. Each Field has a name and a value. You can think of a Document as a row in an RDBMS, and Fields as columns in that row.

Let's consider a simple scenario in which we add a single contact entry with all its fields to the index. Listing 3 shows how we could do it, using the classes we just described.

Listing 3. Lucene-based address book indexer

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;


/**
 * <code>AddressBookIndexer</code> class provides a simple
 * example of indexing with Lucene.  It creates a fresh
 * index called "address-book" in a temporary directory every
 * time it is invoked and adds a single document with a
 * few fields to it.
 */
public class AddressBookIndexer
{
    public static void main(String args[]) throws Exception
    {
        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "address-book";
        Analyzer analyzer = new WhitespaceAnalyzer();
        boolean createFlag = true;

        IndexWriter writer = new IndexWriter(indexDir, analyzer, createFlag);
        Document contactDocument  = new Document();
        contactDocument.add(Field.Text("type", "individual"));
        contactDocument.add(Field.Text("name", "Zane Pasolini"));
        contactDocument.add(Field.Text("address", "999 W. Prince St."));
        contactDocument.add(Field.Text("city", "New York"));
        contactDocument.add(Field.Text("province", "NY"));
        contactDocument.add(Field.Text("postalcode", "10013"));
        contactDocument.add(Field.Text("country", "USA"));
        contactDocument.add(Field.Text("telephone", "1-212-345-6789"));
        writer.addDocument(contactDocument);
        writer.close();
    }
}

What exactly is happening here? Lucene indices are stored in directories in the filesystem. Each index is contained within a single directory, and multiple indices cannot share a directory. The first parameter in IndexWriter's constructor specifies the directory where the index should be stored. The second parameter provides the implementation of Analyzer that should be used for preprocessing the text before it is indexed. The particular implementation of Analyzer that we are using here uses the whitespace character as the delimiter for tokenizing the input. The last parameter is a boolean flag that, when true, tells IndexWriter to create a new index in the specified directory, or to overwrite any existing index in that directory. A value of false instructs IndexWriter to add Documents to an existing index instead. We then create a blank Document, and add several Text Fields to it. After the Document is populated, we add it to the index through the instance of IndexWriter; finally, we close the index. Closing the IndexWriter is important, as doing so ensures that all index changes are flushed to the disk.

It is important to note that Lucene offers several types of Fields. In this example I used the Text Fields, because Lucene doesn't just index them, but also stores their original value verbatim in the index. This allows us to show all the contact fields when searching the index. To learn more about other types of Fields in Lucene, see the Resources section.

Marrying Digester and Lucene
Now that you know how to use each of these tools on their own, we can combine the two classes we've written. We'll use Digester to handle XML parsing, and Lucene to handle indexing. You can see the resulting DigesterMarriesLucene class in Listing 4. (Clicking Listing 4 causes a new browser window to open. Keep that window open so you can refer to Listing 4 while reading the following discussion.)

Let's look at some selections from this class in more detail. Just as we did in the AddressBookIndexer class, we need to open the Lucene index for writing using IndexWriter; we do so here in Listing 5. We pass in the path to the index directory, the Analyzer to process all data being indexed, and a createFlag that is set to true, so that the index is opened in the append mode.

Listing 5. Opening the index for writing

String indexDir =
    System.getProperty("java.io.tmpdir", "tmp") +
    System.getProperty("file.separator") + "address-book";
Analyzer analyzer = new WhitespaceAnalyzer();
boolean createFlag = true;

// IndexWriter to use for adding contacts to the index
writer = new IndexWriter(indexDir, analyzer, createFlag);

The modified addContact(Contact) method shown in Listing 6 now creates a fresh instance of the Lucene Document every time it is called. After the Document is populated with data from the Contact instance that is passed into the method, it is added to the index through an instance of IndexWriter.

Listing 6. New addContact(Contact) method adds the document to the index

Document contactDocument  = new Document();
contactDocument.add(Field.Text("type", contact.getType()));
contactDocument.add(Field.Text("name", contact.getName()));
contactDocument.add(Field.Text("address", contact.getAddress()));
contactDocument.add(Field.Text("city", contact.getCity()));
contactDocument.add(Field.Text("province", contact.getProvince()));
contactDocument.add(Field.Text("postalcode", contact.getPostalcode()));
contactDocument.add(Field.Text("country", contact.getCountry()));
contactDocument.add(Field.Text("telephone", contact.getTelephone()));
writer.addDocument(contactDocument);

Finally, in Listing 7, at the end of the main() method, the index is optimized and closed to ensure that all Documents added to it are indeed written to the index on the disk.

Listing 7. Optimizing and closing the index

// optimize and close the index
writer.optimize();
writer.close();

Using Lucene to search text
Now that we can create a Lucene index from a document containing address books entries encoded in XML, all we need is the ability to search that index. Lucene's API for searching is as simple as the indexing API. In the class in Listing 8, we search the index we created with the DigesterMarriesLucene class. Here, we run a query that looks for all contacts that contain the keyword "Zane" in the field called name.

Listing 8. Searching the address book index created with the Lucene indexer

import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import java.io.IOException;


/**
 * <code>AddressBookSearcher</code> class provides a simple
 * example of searching with Lucene.  It looks for an entry whose
 * 'name' field contains keyword 'Zane'.  The index being searched
 * is called "address-book", located in a temporary directory.
 */
public class AddressBookSearcher
{
    public static void main(String[] args) throws IOException
    {
        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "address-book";
        IndexSearcher searcher = new IndexSearcher(indexDir);
        Query query = new TermQuery(new Term("name", "Zane"));
        Hits hits = searcher.search(query);
        System.out.println("NUMBER OF MATCHING CONTACTS: " + hits.length());
        for (int i = 0; i < hits.length(); i++)
        {
            System.out.println("NAME: " + hits.doc(i).get("name"));
        }
    }
}

You can see that the IndexSearcher class is used for accessing an existing index. The argument passed to its constructor is the path to the directory where the index is stored. Lucene provides a few different query types, and TermQuery is the simplest of them. The query in the code above will find all listings that contain the term "Zane" in a field called name. The call to IndexSearcher's search(Query) method executes the search against the index and returns a collection of matching Documents in an instance of Hits.

While this search example is very simple, note that Lucene offers a rich set of search-related features. For instance, you can use several different types of queries with Lucene: boolean queries, wild-card queries, phrase queries, and so on. Lucene also offers the ability to search multiple indices at once, as well as the ability to search indices located on remote computers. Another useful feature is Lucene's QueryParser, which supports a powerful and user-friendly query syntax. For more information about Lucene's query syntax, see the Resources section.

Conclusion
You should now have good understanding of how to use Jakarta Commons Digester to parse XML documents and how to use Jakarta Lucene to index XML documents and search the resulting index. The approach described in this article should satisfy the simple XML indexing and searching needs of most developers. You should also take a look at the Sandbox subproject of Lucene, which includes examples of indexing of XML documents using SAX 2 and DOM parsers. For more complex and generic solutions, visit Lucene's contributions page, a link to which is included in Resources.

Resources

About the author
Otis Gospodnetic is an active Apache Jakarta member, a developer of Lucene, and maintainer of the jGuru's Lucene FAQ. His professional interests include Web crawlers, information gathering and retrieval, and distributed computing. Otis currently lives in New York City and can be reached at otis@apache.org.


103 KBe-mail it!

What do you think of this document?
Killer! (5) Good stuff (4) So-so; not bad (3) Needs work (2) Lame! (1)

Comments?



developerWorks > Java technology | XML
developerWorks
  About IBM  |  Privacy  |  Terms of use  |  Contact