J David Eisenberg
This article will show you how to
parse XHTML with dom4j and
Groovy. We will write a program
that “repairs” an XHTML file by finding all the
<img> elements
that do not have an alt attribute and putting in a placeholder
value for that attribute.
Here’s the XHTML file that we will start with. Notice that it
doesn’t have a <!DOCTYPE> yet–we’ll
get to that issue later.
<html> <head> <head> <title>Various Cats</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <p>Tabitha was a great cat; I had her for twelve years. <img src="tabitha.jpg" alt="Black female DSH cat" /></p> <p>Marco is five years old, but still has a lot of kitten in him. <img src="marco.jpg" alt="Flame point siamese mix male DSH cat" /></p> <p>Big Tony is not very bright, but he is still loveable. <img src="tony.jpg" /></p> <p>Misha is the feral cat-in-residence at Evergreen Valley College in San Jose, California. <img src="misha.jpg" /></p> <hr /> <textarea rows="5" cols="5"></textarea> </body> </html>
You may see the entire file as cats.html in the downloadable zip file.
Since the program uses dom4j, you must have the latest dom4j jar file on your classpath. Additionally, since the program uses dom4j’s XPath capability, you must also have Jaxen on your classpath.
Our first task is to open the file from standard input and parse it into a document. The filename will come from the command line argument. The Groovy code to do this is as follows:
/* Import necessary libraries */ import org.dom4j.Document import org.dom4j.Element import org.dom4j.Attribute import org.dom4j.io.SAXReader import org.dom4j.io.XMLWriter import org.dom4j.xpath.DefaultXPath SAXReader reader // the reader will do the parsing Document htmlDoc // this will be the parsed document DefaultXPath path // XPath used to find the <img> elements def imageNodes // variable for processing elements individually XMLWriter output // output file (this will be standard output) reader = new SAXReader( ) htmlDoc = reader.read( System.in )
To select all the <img> elements that do not
have an alt attribute, you need to create
and execute this XPath expression:
path = new DefaultXPath( "body//img[not(@alt)]" ) imageNodes = path.selectNodes( htmlDoc.getRootElement() )
Note the two slashes in a row; this will find all <img>
elements within the <body>, no matter how deeply they
are nested in other elements. If the path had only one slash, then it would
find only those <img> elements that were direct
descendants of <body> (which, in the example file, would
be none at all). The argument to selectNodes gives the
context node at which to start; in this case, the root element of the
document, which is the <html> element.
The modification of the XHTML is a straightforward for loop,
adding an alt attribute whose value is the word
Placeholder: followed by the value of the src
attribute. In this code, the
attributeValue() method returns a String giving
the attribute’s value, or the null string if there is no such attribute.
This is different from the attribute() method, which returns
an Attribute object.
for (i in 0..<imageNodes.size())
{
element = imageNodes.get(i)
element.addAttribute( "alt",
"Placeholder: " + element.attributeValue( "src" ) )
}
If this were an ordinary XML file, you would use dom4j’s
XMLWriter
class, but XHTML is different from ordinary XML in that it is being sent to
browsers, which causes two problems with empty elements
If you don’t include a blank before the />,
some older browsers will get confused and not interpret the element
properly.
Most XML writers will write all empty elements in the
“short form.” This isn’t a problem with elements like
<hr/> and <br/>, but you do not
want this input:
<textarea name="input" rows="5" cols="30"></textarea>
to be written as this, which browsers will handle incorrectly:
<textarea name="input" rows="5" cols="30" />
Luckily, dom4j is provided with the HTMLWriter class, which
allows you to overcome these problems by setting the writer’s output
format. The two settings shown here interact to give short form for only those
HTML elements that are truly empty elements. Container elements will have
both an opening and closing tag.
outWriter = new HTMLWriter( System.out ) outWriter.getOutputFormat().setXHTML(true) outWriter.getOutputFormat().setExpandEmptyElements(true)
Finally, output the document. The cast is necessary because the
write() method can take either a Document or a
Node as an argument, and you need to make your intentions
unambiguous.
outWriter.write( (Document) htmlDoc )
You may see the entire file as modify_html.groovy in the zip file.
The only problem with the XHTML file is that it isn’t true XHTML.
If you try to run it through the
W3C HTML Validator, it will complain
bitterly that you are missing a <!DOCTYPE>. “No
problem,” you think. You change the beginning of the file to look
like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
Save the modified code as cats_ns.html (the ns
stands for “namespace”) and run the code again.
Much to your astonishment, the output hasn’t
added the missing alt attributes.
That’s because your XPath is looking for
<body> and <img> elements that
do not belong to any namespace at all. The addition of the
attribute xmlns="http://www.w3.org/1999/xhtml" made that
URI the default namespace for the <body> element.
To fix this, you have to specify that the elements in your XPath expression have that same namespace URI. Change the XPath declaration as follows:
path = new DefaultXPath( "h:body//h:img[not(@alt)]" ) path.setNamespaceURIs( [h: "http://www.w3.org/1999/xhtml"] )
In this example, we arbitrarily chose h as the namespace
prefix; it could be anything we like, but h is easier to
type. Notice that there is no
prefix on the @alt, because attributes in XHTML don’t
have a namespace specifier.
The setNamespaceURIs method takes a Map
as its argument. The map contains one or more sets of prefix and URI as its
keys and values. Once this change is made, everything works correctly
again.
There’s one other problem here: once the
<!DOCTYPE> enters the scene, the parser will try to
access the URI that the DOCTYPE specifies. The
parser has to do this because that URI points to a Document Type Definition
(DTD) that, among other things, lists all the available HTML entities.
If you don’t believe this,
change San Jose to San José
in the original cats.html file (see
file cats_accented.html in the
zip file). When you run the
program, you’ll get this message:
The entity "eacute" was referenced, but not declared.
As long as you are connected to the Internet, things are fine. But if you try working offline, you will see the parser hang as it tries in vain to access http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
To solve this problem, we will make use of a catalog, which
associates the PUBLIC identifier in the DOCTYPE
with a local file that contains the DTD.
You can get the appropriate catalog along
with copies of the XHTML DTDs by downloading
sgml-lib.tar.gz
from the W3C XHTML validator site.
To make use of the catalog, you’ll need to download
resolver.jar from
XML commons and
put it on your classpath. You must then add these
import statements and declarations to your code:
import org.apache.xml.resolver.CatalogManager import org.apache.xml.resolver.tools.CatalogResolver CatalogManager cMgr CatalogResolver cResolver
Now initialize the variables. The catalog manager normally gets its options from a file named CatalogManager.properties. Creating and modifying this file is easier than setting the properties programmatically, so put the following CatalogManager.proeprties file in one of your classpath’s directories.
# allow location to be relative to this file's directory relative-catalogs=yes # A semicolon-delimited list of catalog files. # In this instance, we have a single catalog file, and it's a relative # path name catalogs=sgml-lib/xml.soc # no debugging messages, please verbosity=0 # Use the SYSTEM identifier prefer=system
And then initialize the resolver, and tell the parser to use the catalog resolver when it encounters the DTD:
cResolver = new CatalogResolver( cMgr ) reader = new SAXReader( ) reader.setEntityResolver( cResolver )
When you run this revised code, you will see that it successfully finds
and modifies the img elements.
You will find this as file modify_html_ns.groovy
in the zip file.
When using dom4j with XHTML:
Do your output via HTMLWriter; it takes care of
the messy problems of empty elements.
Use real HTML; provide a
<!DOCTYPE>,
set the xmlns attribute in your document,
and use setNamespaceURIs to make the
parser namespace-aware.
Ensure that you can process files even if you aren’t online by using a catalog with local copies of the XHTML DTDs.
This
work is licensed under a
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.