A couple of days ago I tried to modify a html page using a DOM. When I tried to convert the page into a Document instance the parser threw some SAXExceptions, complaining about the structure of the document, sort of "this tag needs to be closed" and the like.
The source was html output generated from Docbook. I neither had the time nor the intent to mess around cleaning up the generated html, but could remember there was something called Tidy.
So I searched for a Java library, and there it was.
JTidy, looking like an unmaintained project, but being the right tool to clean up a html page and transform it into valid xhtml.
The API is pretty straight forward.
This is the implementation for converting (non-valid) html to a Document instance:
// Create instance
final Tidy tidy = new Tidy();
// Remove presentational clutter (don't really know
// what exactly that does, but sounds great ;-)
tidy.setMakeClean( true );
// Use XHTML output
tidy.setXHTML( true );
// Make document readable by indenting the elements
tidy.setSmartIndent( true );
// The html document received by a get request
final String s = ...;
// Converting the page into a Document instance
final Document document = tidy.parseDOM( new ByteArrayInputStream( s.getBytes() ) , null );
That's it, by now you have your html as a Document instance that you can freely manipulate.
The only thing I noticed was that the method node.setTextContent() does not work. But you can use node.appendChild( document.createTextNode( ... ) ), that does what you want.
The second part is about writing your Document to a string:
// Create a stream to write the output to
final ByteArrayOutputStream outStr2 = new ByteArrayOutputStream();
// Write modified Document to an output stream
tidy.pprint( document , outStr2 );
// Create a StringBuilder
final StringBuilder builder = new StringBuilder();
// Write output stream content to string builder
builder.append( new String( outStr2.toByteArray() , "UTF-8" ) );
// Create String
final String validXHTML = builder.toString();
At the end of the block you have your valid XHTML in a String.