I am happy to announce that the much anticipated W3C Connector, after lots of coding, testing, bug fixing and several months of successful usage in a commercial product was proven worthy to be released to the public. If everything goes well, it will hit the streets next week.
OK, but what the heck is the W3C Connector?
The W3C Connector is a Java package which can be used to access the Mozilla DOM tree from Java, while implementing the standard org.w3c.* interfaces. This means you can use it with any standard Java package that is expecting org.w3c.* interfaces ( Xerces, Saxon, Jaxen, … ) to execute effective queries on the Mozilla DOM (XML/XSLT/XPath/XQuery operations for example).
Technically,the W3C Connector is an implementation of the standard org.w3c.* interfaces. The implementing classes are calling Javier Pedemonte’s JavaXPCOM package, which in turn wraps the Mozilla XPCOM in order to gain access to the Mozilla DOM. See the image for an illustration:
This is very nice and all, but why should I care about it?
If you ever wanted to do (or have done) a screen scraping application, where you needed to understand the underlying document to some extent (regular expressions were not sufficient) you should know that there are many pitfalls along the way:
- First of all, malformed HTML code: Despite the continuous efforts of the W3C and other organizations/individuals to remedy this problem by promoting X(HT)ML and other machine parsable formats, a lots of web pages still have malformed code in them. Based on the level of non-standardness, parsing such a page can be more than a moderate technical problem: in practice there are pages which can not be parsed to produce an usable input.
- You can not use a standard query language like XPath or XQuery – these languages require a XML input, which you can not ensure because of the previous point, so you are left to roll your own code to process the parsed data.
Of course this is not a big problem for a crafted programmer, mainly if he is equipped with tools like HTMLTidy to address the first point, RubyfulSoup or similar to tackle the second. However, even these (and other) tools and a cool programming language are still just easing up the pain of effective screen scraping, but not offering a generic solution. If you want to scrap a lot and different pages, these problems in practice will cripple your efforts (or at least make it last very long time in practice).
How does the W3C Connector solve this problem?
By solving both points: The Mozilla DOM is a structure reflecting how gecko (the mozilla rendering engine) renders the page, and it always translates to valid XML (no unclosed tags or otherwise malformed code), and because of implementing the org.w3c.* interfaces you can use very robust and effective XPath packages (like Saxon) to query the document for effective HTML extraction.
There are of course a lot of other possible uses – the connector is not a tool itself, but a gateway to the world of W3C compliant XML tools – it is up to you how to leverage the power it gives you.
The Big Brother
The W3C Connector will be released officially as the part of theATF project. The code is under the last review at the moment, it is possible that I will come out with a preview release before the official one.