W3C Mozilla DOM Connector will be released soon!

I am happy to announce that the much anticipated W3C Connector, after lots of coding, testing, bug fixing and several months of successful usage in a commercial product was proven worthy to be released to the public. If everything goes well, it will hit the streets next week.

OK, but what the heck is the W3C Connector?

The W3C Connector is a Java package which can be used to access the Mozilla DOM tree from Java, while implementing the standard org.w3c.* interfaces. This means you can use it with any standard Java package that is expecting org.w3c.* interfaces ( Xerces, Saxon, Jaxen, … ) to execute effective queries on the Mozilla DOM (XML/XSLT/XPath/XQuery operations for example).

Technically,the W3C Connector is an implementation of the standard org.w3c.* interfaces. The implementing classes are calling Javier Pedemonte’s JavaXPCOM package, which in turn wraps the Mozilla XPCOM in order to gain access to the Mozilla DOM. See the image for an illustration:

This is very nice and all, but why should I care about it?

If you ever wanted to do (or have done) a screen scraping application, where you needed to understand the underlying document to some extent (regular expressions were not sufficient) you should know that there are many pitfalls along the way:

  • First of all, malformed HTML code: Despite the continuous efforts of the W3C and other organizations/individuals to remedy this problem by promoting X(HT)ML and other machine parsable formats, a lots of web pages still have malformed code in them. Based on the level of non-standardness, parsing such a page can be more than a moderate technical problem: in practice there are pages which can not be parsed to produce an usable input.
  • You can not use a standard query language like XPath or XQuery – these languages require a XML input, which you can not ensure because of the previous point, so you are left to roll your own code to process the parsed data.

Of course this is not a big problem for a crafted programmer, mainly if he is equipped with tools like HTMLTidy to address the first point, RubyfulSoup or similar to tackle the second. However, even these (and other) tools and a cool programming language are still just easing up the pain of effective screen scraping, but not offering a generic solution. If you want to scrap a lot and different pages, these problems in practice will cripple your efforts (or at least make it last very long time in practice).

How does the W3C Connector solve this problem?

By solving both points: The Mozilla DOM is a structure reflecting how gecko (the mozilla rendering engine) renders the page, and it always translates to valid XML (no unclosed tags or otherwise malformed code), and because of implementing the org.w3c.* interfaces you can use very robust and effective XPath packages (like Saxon) to query the document for effective HTML extraction.

There are of course a lot of other possible uses – the connector is not a tool itself, but a gateway to the world of W3C compliant XML tools – it is up to you how to leverage the power it gives you.

The Big Brother

The W3C Connector will be released officially as the part of theATF project. The code is under the last review at the moment, it is possible that I will come out with a preview release before the official one.

13 thoughts on “W3C Mozilla DOM Connector will be released soon!

  1. Pingback: Ruby, Rails, Web2.0 » Blog Archive » W3C Mozilla DOM Connector was released today

  2. Hey

    I was surfing the web and i saw this site, pretty cool.
    Currently im running and adult site:Reachton
    k, just want to say hi 🙂
    Can i link you from my site? im looking for quality content like yours. If no let me know if i can add u in exchange for a montly fee or something.

  3. Hey

    I was surfing the web and i saw this site, pretty cool.
    Currently im running and adult site:Reachton
    k, just want to say hi 🙂
    Can i link you from my site? im looking for quality content like yours. If no let me know if i can add u in exchange for a montly fee or something.

  4. I was looking for some good blog post covering Ruby, Rails, Web2.0 » Blog Archive » W3C Mozilla DOM Connector will be released soon!. Searching in Bing I finally found your blog. After going over this information I’m happy to say that I most definatelly found just what I was looking for. I will make sure to save site and check it out regularly . Thanks! 🙂

  5. Please let me know if you’re looking for a writer for your blog.
    You have some really good articles and I believe I would be a good asset.
    If you ever want to take some of the load off,
    I’d absolutely love to write some articles for your blog in exchange for a
    link back to mine. Please blast me an email if
    interested. Many thanks!

  6. Hey there! I just wanted to ask if you ever have any problems with hackers?
    My last blog (wordpress) was hacked and I ended up losing a few months
    of hard work due to no back up. Do you have any methods to
    prevent hackers?

  7. Enjoyed reading this, very good stuff, regards . “Nothing happens to any thing which that thing is not made by nature to bear.” by Marcus Aurelius Antoninus.

Leave a Reply

Your email address will not be published. Required fields are marked *