[Israel.pm] HTML Tables Parsing with Perl
Offer Kaye
oferk at oren.co.il
Thu May 27 05:20:40 PDT 2004
>
> Hi,
>
> I know this sounds simple, and of course I didn't try to re-invent the
> wheel BUT I used HTML::TableContentParser...
> Which doesn't really work well for me :)
>
Hi Yuval,
Since HTML::TableContentParser's README states that it doesn't support
nested tables, this is obviously a feature, not a bug :-)
Using Mikhael's code would work, up until the time (for example) some
malicious bastard put a TD tag inside an HTML comment...
So unless you have a really strong aversion to depending on extrnal modules,
I suggest using them instead of custom regexes.
One module which would work would be HTML::TokeParser :
http://search.cpan.org/dist/HTML-Parser/lib/HTML/TokeParser.pm
For example, the following code simply prints every TD content. If you run
this you'll note it "does the right thing"(tm) regarding the nested table:
############## >>> CODE START
use warnings;
use strict;
use HTML::TokeParser;
# Grab the HTML:
my $page_content = q(
<html>
<head><title>kuku</title></head>
<body>
<h1> My stuff </h1>
<TABLE border="5"><TR>
<TD>
text
b/w
<TABLE border="2">
<TR>
<TD>more text</TD><TD>yet more text</TD>
</TR>
</TABLE>
</TD>
<td>foo</td><td>bar</td>
</TR>
<tr>
<td>2nd row</td>
</tr>
</TABLE>
</body>
</html>
);
# Parse the HTML
my $p = HTML::TokeParser->new(\$page_content);
# Search for a table column (td tag) and print the contents.
while (defined $p->get_tag("td") ){
print $p->get_text,"\n";
}
############## >>> CODE END
If you want a solution that better preserves the table structure, this
module might be better:
http://search.cpan.org/dist/HTML-TableExtract/
Although I personally haven't used it...
Regards,
----------------------------------
Offer Kaye
More information about the Perl
mailing list