The source was html output generated from Docbook. I neither had the time nor the intent to mess around cleaning up the generated html, but could remember there was something called Tidy.
So I searched for a Java library, and there it was. JTidy, looking like an unmaintained project, but being the right tool to clean up a html page and transform it into valid xhtml.
The API is pretty straight forward.
This is the implementation for converting (non-valid) html to a Document instance:
// Create instance
final Tidy tidy = new Tidy();
// Remove presentational clutter (don't really know
// what exactly that does, but sounds great ;-)
tidy.setMakeClean( true );
// Use XHTML output
tidy.setXHTML( true );
// Make document readable by indenting the elements
tidy.setSmartIndent( true );
// The html document received by a get request
final String s = ...;
// Converting the page into a Document instance
final Document document = tidy.parseDOM( new ByteArrayInputStream( s.getBytes() ) , null );
That's it, by now you have your html as a Document instance that you can freely manipulate.
The only thing I noticed was that the method node.setTextContent() does not work. But you can use node.appendChild( document.createTextNode( ... ) ), that does what you want.
The second part is about writing your Document to a string:
// Create a stream to write the output to
final ByteArrayOutputStream outStr2 = new ByteArrayOutputStream();
// Write modified Document to an output stream
tidy.pprint( document , outStr2 );
// Create a StringBuilder
final StringBuilder builder = new StringBuilder();
// Write output stream content to string builder
builder.append( new String( outStr2.toByteArray() , "UTF-8" ) );
// Create String
final String validXHTML = builder.toString();
At the end of the block you have your valid XHTML in a String.
No comments:
Post a Comment