Should I Use Html::parser Or Xml::parser To Extract And Replace Text?
Solution 1:
The approach of HTML::Parser is based on tokens and callbacks. I find it very convenient when you have particularly complex conditions on the context in which the data you whish to extract or to change occurs.
Otherwise I prefer a tree based approach. HTML::TreeBuilder::XPath (based ultimely on HTML::Parser) allows you to find nodes with XPath. It returns HTML::Elements. The documentation is a little scarce (well, spread over a couple of modules). But still the quick way to mine into HTML.
If you deal with pure XML, XML::Twig is an outstanding parser: very good memory management, allows to combine the tree and stream approaches. And the documentation is very good.
Solution 2:
Say in someone's StackOverflow user page you want to replace all instances of PERL with Perl. You could do so with
#! /usr/bin/perluse warnings;
use strict;
use HTML::Parser;
use LWP::Simple;
my $html = get "http://stackoverflow.com/users/201469/phil-jackson";
die"$0: get failed"unlessdefined $html;
subreplace_text{
  my($skipped,$markup) = @_;
  $skipped =~ s/\bPERL\b/Perl/g;
  print $skipped, $markup;
}
my $p = HTML::Parser->new(
  api_version =>3,
  marked_sections =>1,
  case_sensitive =>1,
  unbroken_text =>1,
  xml_mode =>1,
  start_h => [ \&replace_text =>"skipped_text, text" ],
  end_h => [ \&replace_text =>"skipped_text, text" ],
);
# your page may use a different encodingbinmode STDOUT, ":utf8"ordie"$0: binmode: $!";
$p->parse($html);
The output is what we expect:
$ wget -O phil-jackson.html http://stackoverflow.com/users/201469
$ ./replace-text >out.html
$ diff -ub phil-jackson.html out.html
--- phil-jackson.html
+++ out.html
@@ -327,7 +327,7 @@
 PERL:  
-#$linkTrue =  … ">comparing PERL md5() and PHP md5()</a></h3>
+#$linkTrue =  … ">comparing Perl md5() and PHP md5()</a></h3>
         <div class="tags t-php t-perl t-md5">
             <a href="/questions/tagged/php" class="post-tag" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/perl" class="post-tag" title="show questions tagged 'perl'" rel="tag">perl</a> <a href="/questions/tagged/md5" class="post-tag" title="show questions tagged 'md5'" rel="tag">md5</a> The "PERL:" sore thumb is part of an element attribute, not a text section.
Solution 3:
You should also look at Web::Scraper. I find this module easier than the HTML::Parser modules, but it helps if your are familiar with XPath. Parsing of HTML is very unpredictable depending on the actual pages - it is like pdf-display and not data-oriented.
Solution 4:
Which module you should use depends on what you are trying to do. For starters, HTML::Parser comes with great examples which also include a script that extracts plain text from an HTML document.
Do not try to parse HTML documents using an XML parser: You will find yourself in a world of pain as a lot of valid HTML constructs are not valid XML.
Do not try to parse XML documents using an HTML parser: You will lose all the advantages of the stricter requirement that an XML document be well formed before it can be parsed.
Post a Comment for "Should I Use Html::parser Or Xml::parser To Extract And Replace Text?"