W3C Mozilla DOM Connector was released today

I have announced the upcoming release of the W3C Mozilla DOM Connector in one if my previous posts, and now it has finally arrived. You can view it at

http://svn.rubyrailways.com/W3CConnector/

or check it out with svn:

svn co http://svn.rubyrailways.com/W3CConnector/

For a description about the connector, please refer to my previous post. If you would like to try it out, here is how:

#this code snippet gives you a DOM document of the currently loaded page:
  nsIWebBrowser brow = getWebBrowser();
  nsIWebNavigation nav =
      (nsIWebNavigation)
      brow.queryInterface(nsIWebNavigation.NS_IWEBNAVIGATION_IID);
  nsIDOMDocument doc = (nsIDOMDocument) nav.getDocument();
  Document mozDoc = (Document)
org.mozilla.dom.NodeFactory.getNodeInstance(doc);

From now on, you can use all the existing java/dom libraries such as an XPath2 engine like saxon, xalan, whatever you want working on mozilla documents.
This means tremendous power compared to (in their category outstanding, but still limited) tools like RubyfulSoup or Mechanize, stemming from the power of XPath to query XML documents.
A simple example – dumping DOM of the html document to stdout:

public static void writeDOM(Node n)
      throws IOException
  {
      try {
          StreamResult sr = new StreamResult(System.out);
          TransformerFactory trf = TransformerFactory.newInstance();
          Transformer tr = trf.newTransformer();
          tr.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
          tr.transform(new DOMSource(n), sr);
      }
      catch (TransformerException e) {
          throw new IOException();
      }
   }

Cool, isn’t it?
At the moment, I am discussing different integration issues with the Mozilla guys, since the connector should be the part of Mozilla and the Eclipse editor in the future.

W3C Mozilla DOM Connector will be released soon!

I am happy to announce that the much anticipated W3C Connector, after lots of coding, testing, bug fixing and several months of successful usage in a commercial product was proven worthy to be released to the public. If everything goes well, it will hit the streets next week.

OK, but what the heck is the W3C Connector?

The W3C Connector is a Java package which can be used to access the Mozilla DOM tree from Java, while implementing the standard org.w3c.* interfaces. This means you can use it with any standard Java package that is expecting org.w3c.* interfaces ( Xerces, Saxon, Jaxen, … ) to execute effective queries on the Mozilla DOM (XML/XSLT/XPath/XQuery operations for example).

Technically,the W3C Connector is an implementation of the standard org.w3c.* interfaces. The implementing classes are calling Javier Pedemonte’s JavaXPCOM package, which in turn wraps the Mozilla XPCOM in order to gain access to the Mozilla DOM. See the image for an illustration:

This is very nice and all, but why should I care about it?

If you ever wanted to do (or have done) a screen scraping application, where you needed to understand the underlying document to some extent (regular expressions were not sufficient) you should know that there are many pitfalls along the way:

  • First of all, malformed HTML code: Despite the continuous efforts of the W3C and other organizations/individuals to remedy this problem by promoting X(HT)ML and other machine parsable formats, a lots of web pages still have malformed code in them. Based on the level of non-standardness, parsing such a page can be more than a moderate technical problem: in practice there are pages which can not be parsed to produce an usable input.
  • You can not use a standard query language like XPath or XQuery – these languages require a XML input, which you can not ensure because of the previous point, so you are left to roll your own code to process the parsed data.

Of course this is not a big problem for a crafted programmer, mainly if he is equipped with tools like HTMLTidy to address the first point, RubyfulSoup or similar to tackle the second. However, even these (and other) tools and a cool programming language are still just easing up the pain of effective screen scraping, but not offering a generic solution. If you want to scrap a lot and different pages, these problems in practice will cripple your efforts (or at least make it last very long time in practice).

How does the W3C Connector solve this problem?

By solving both points: The Mozilla DOM is a structure reflecting how gecko (the mozilla rendering engine) renders the page, and it always translates to valid XML (no unclosed tags or otherwise malformed code), and because of implementing the org.w3c.* interfaces you can use very robust and effective XPath packages (like Saxon) to query the document for effective HTML extraction.

There are of course a lot of other possible uses – the connector is not a tool itself, but a gateway to the world of W3C compliant XML tools – it is up to you how to leverage the power it gives you.

The Big Brother

The W3C Connector will be released officially as the part of theATF project. The code is under the last review at the moment, it is possible that I will come out with a preview release before the official one.