Parsing XHTML with dom4j and Groovy

J David Eisenberg

This article will show you how to parse XHTML with dom4j and Groovy. We will write a program that “repairs” an XHTML file by finding all the <img> elements that do not have an alt attribute and putting in a placeholder value for that attribute.

The XHTML File

Here’s the XHTML file that we will start with. Notice that it doesn’t have a <!DOCTYPE> yet–we’ll get to that issue later.

<html>
<head>
<head>
	<title>Various Cats</title>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<p>Tabitha was a great cat; I had her for twelve years.
	<img src="tabitha.jpg" alt="Black female DSH cat" /></p>
<p>Marco is five years old, but still has a lot of kitten in him. 
	<img src="marco.jpg" alt="Flame point siamese mix male DSH cat" /></p>
<p>Big Tony is not very bright, but he is still loveable.
	<img src="tony.jpg" /></p>
<p>Misha is the feral cat-in-residence at Evergreen Valley College in
	San Jose, California.
	<img src="misha.jpg" /></p>
<hr />
<textarea rows="5" cols="5"></textarea>
</body>
</html>

You may see the entire file as cats.html in the downloadable zip file.

Setting Up

Since the program uses dom4j, you must have the latest dom4j jar file on your classpath. Additionally, since the program uses dom4j’s XPath capability, you must also have Jaxen on your classpath.

The Code

Our first task is to open the file from standard input and parse it into a document. The filename will come from the command line argument. The Groovy code to do this is as follows:

/* Import necessary libraries */
import org.dom4j.Document
import org.dom4j.Element
import org.dom4j.Attribute
import org.dom4j.io.SAXReader
import org.dom4j.io.XMLWriter
import org.dom4j.xpath.DefaultXPath

SAXReader reader	// the reader will do the parsing
Document htmlDoc	// this will be the parsed document
DefaultXPath path	// XPath used to find the <img> elements
def imageNodes		// variable for processing elements individually
XMLWriter output	// output file (this will be standard output)

reader = new SAXReader( )
htmlDoc = reader.read( System.in )

To select all the <img> elements that do not have an alt attribute, you need to create and execute this XPath expression:

path = new DefaultXPath( "body//img[not(@alt)]" )
imageNodes = path.selectNodes( htmlDoc.getRootElement() )

Note the two slashes in a row; this will find all <img> elements within the <body>, no matter how deeply they are nested in other elements. If the path had only one slash, then it would find only those <img> elements that were direct descendants of <body> (which, in the example file, would be none at all). The argument to selectNodes gives the context node at which to start; in this case, the root element of the document, which is the <html> element.

The modification of the XHTML is a straightforward for loop, adding an alt attribute whose value is the word Placeholder: followed by the value of the src attribute. In this code, the attributeValue() method returns a String giving the attribute’s value, or the null string if there is no such attribute. This is different from the attribute() method, which returns an Attribute object.

for (i in 0..<imageNodes.size())
{
	element = imageNodes.get(i)
	element.addAttribute( "alt",
			"Placeholder: " + element.attributeValue( "src" ) )
}

Producing Output

If this were an ordinary XML file, you would use dom4j’s XMLWriter class, but XHTML is different from ordinary XML in that it is being sent to browsers, which causes two problems with empty elements

  1. If you don’t include a blank before the />, some older browsers will get confused and not interpret the element properly.

  2. Most XML writers will write all empty elements in the “short form.” This isn’t a problem with elements like <hr/> and <br/>, but you do not want this input:

    <textarea name="input" rows="5" cols="30"></textarea>

    to be written as this, which browsers will handle incorrectly:

    <textarea name="input" rows="5" cols="30" />

Luckily, dom4j is provided with the HTMLWriter class, which allows you to overcome these problems by setting the writer’s output format. The two settings shown here interact to give short form for only those HTML elements that are truly empty elements. Container elements will have both an opening and closing tag.

outWriter = new HTMLWriter( System.out )
outWriter.getOutputFormat().setXHTML(true)
outWriter.getOutputFormat().setExpandEmptyElements(true)

Finally, output the document. The cast is necessary because the write() method can take either a Document or a Node as an argument, and you need to make your intentions unambiguous.

outWriter.write( (Document) htmlDoc )

You may see the entire file as modify_html.groovy in the zip file.

Real XHTML

The only problem with the XHTML file is that it isn’t true XHTML. If you try to run it through the W3C HTML Validator, it will complain bitterly that you are missing a <!DOCTYPE>. “No problem,” you think. You change the beginning of the file to look like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

Save the modified code as cats_ns.html (the ns stands for “namespace”) and run the code again. Much to your astonishment, the output hasn’t added the missing alt attributes. That’s because your XPath is looking for <body> and <img> elements that do not belong to any namespace at all. The addition of the attribute xmlns="http://www.w3.org/1999/xhtml" made that URI the default namespace for the <body> element.

To fix this, you have to specify that the elements in your XPath expression have that same namespace URI. Change the XPath declaration as follows:

path = new DefaultXPath( "h:body//h:img[not(@alt)]" )
path.setNamespaceURIs( [h: "http://www.w3.org/1999/xhtml"] )

In this example, we arbitrarily chose h as the namespace prefix; it could be anything we like, but h is easier to type. Notice that there is no prefix on the @alt, because attributes in XHTML don’t have a namespace specifier. The setNamespaceURIs method takes a Map as its argument. The map contains one or more sets of prefix and URI as its keys and values. Once this change is made, everything works correctly again.

Working Off Line

There’s one other problem here: once the <!DOCTYPE> enters the scene, the parser will try to access the URI that the DOCTYPE specifies. The parser has to do this because that URI points to a Document Type Definition (DTD) that, among other things, lists all the available HTML entities. If you don’t believe this, change San Jose to San Jos&eacute; in the original cats.html file (see file cats_accented.html in the zip file). When you run the program, you’ll get this message:

The entity "eacute" was referenced, but not declared.

As long as you are connected to the Internet, things are fine. But if you try working offline, you will see the parser hang as it tries in vain to access http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd

To solve this problem, we will make use of a catalog, which associates the PUBLIC identifier in the DOCTYPE with a local file that contains the DTD. You can get the appropriate catalog along with copies of the XHTML DTDs by downloading sgml-lib.tar.gz from the W3C XHTML validator site.

To make use of the catalog, you’ll need to download resolver.jar from XML commons and put it on your classpath. You must then add these import statements and declarations to your code:

import org.apache.xml.resolver.CatalogManager
import org.apache.xml.resolver.tools.CatalogResolver
CatalogManager cMgr
CatalogResolver cResolver

Now initialize the variables. The catalog manager normally gets its options from a file named CatalogManager.properties. Creating and modifying this file is easier than setting the properties programmatically, so put the following CatalogManager.proeprties file in one of your classpath’s directories.

# allow location to be relative to this file's directory
relative-catalogs=yes

# A semicolon-delimited list of catalog files.
# In this instance, we have a single catalog file, and it's a relative
# path name
catalogs=sgml-lib/xml.soc

# no debugging messages, please
verbosity=0

# Use the SYSTEM identifier 
prefer=system

And then initialize the resolver, and tell the parser to use the catalog resolver when it encounters the DTD:

cResolver = new CatalogResolver( cMgr )
reader = new SAXReader( )
reader.setEntityResolver( cResolver )

When you run this revised code, you will see that it successfully finds and modifies the img elements. You will find this as file modify_html_ns.groovy in the zip file.

Summary

When using dom4j with XHTML:


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.