Extract Tags From A Html File Using Jsoup
I am doing a structural analysis on web documents. For this i need to extract only the structure of a web document(only the tags). I found a html parser for java called Jsoup. But
Solution 1:
Sound like a depth-first traversal:
publicclassJsoupDepthFirst {
privatestaticStringhtmlTags(Document doc) {
StringBuilder sb = newStringBuilder();
htmlTags(doc.children(), sb);
return sb.toString();
}
privatestaticvoidhtmlTags(Elements elements, StringBuilder sb) {
for(Elementel:elements) {
if(sb.length() > 0){
sb.append(",");
}
sb.append(el.nodeName());
htmlTags(el.children(), sb);
sb.append(",").append(el.nodeName());
}
}
publicstaticvoidmain(String... args){
String s = "<html><head>this is head </head><body>this is body</body></html>";
Document doc = Jsoup.parse(s);
System.out.println(htmlTags(doc));
}
}
another solution is to use jsoup NodeVisitor as follows:
SecondSolution ss = new SecondSolution();
doc.traverse(ss);
System.out.println(ss.sb.toString());
class:
publicstaticclassSecondSolutionimplementsNodeVisitor {
StringBuildersb=newStringBuilder();
@Overridepublicvoidhead(Node node, int depth) {
if (node instanceof Element && !(node instanceof Document)) {
if (sb.length() > 0) {
sb.append(",");
}
sb.append(node.nodeName());
}
}
@Overridepublicvoidtail(Node node, int depth) {
if (node instanceof Element && !(node instanceof Document)) {
sb.append(",").append(node.nodeName());
}
}
}
Post a Comment for "Extract Tags From A Html File Using Jsoup"