Skip to content Skip to sidebar Skip to footer

How To Convert An Html Source Of A Webpage Into Org.w3c.dom.document In Java?

How to convert an Html source of a webpage into org.w3c.dom.Documentin Java?

Solution 1:

I suggest http://about.validator.nu/htmlparser/, which implements the HTML5 parsing algorithm. Firefox is in the process of replacing its own HTML parser with this one.

Solution 2:

I have just been playing with JSoup, which is a fantastic Java HTML parser that works a little like jQuery. Really easy to use.

Solution 3:

That's actually a fairly difficult thing to do robustly, because arbitrary HTML web pages are sometimes malformed (the major browsers are fairly tolerant). You may want to look into the swing html parser, which I've never tried but looks like it may be the best option. You also could try something along the lines of this and handle any parsing exceptions that may come up (although I've only ever tried this for xml):

import java.io.File;
import org.w3c.dom.Document;
import org.w3c.dom.*;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException; 

...

try {
    DocumentBuilderFactorydocBuilderFactory= DocumentBuilderFactory.newInstance();
    DocumentBuilderdocBuilder= docBuilderFactory.newDocumentBuilder();
    Documentdoc= docBuilder.parse (InputStreamYouBuiltEarlierFromAnHTTPRequest);
}
catch (ParserConfigurationException e)
{
    ...
}
catch (SAXException e)
{
    ...
}
catch (IOException e)
{
    ...
}

...

Post a Comment for "How To Convert An Html Source Of A Webpage Into Org.w3c.dom.document In Java?"