[Israel.pm] HTML Tables Parsing with Perl

Offer Kaye oferk at oren.co.il
Thu May 27 05:20:40 PDT 2004


>
> Hi,
>
> I know this sounds simple, and of course I didn't try to re-invent the
> wheel BUT I used HTML::TableContentParser...
> Which doesn't really work well for me :)
>

Hi Yuval,
Since HTML::TableContentParser's README states that it doesn't support
nested tables, this is obviously a feature, not a bug :-)
Using Mikhael's code would work, up until the time (for example) some
malicious bastard put a TD tag inside an HTML comment...
So unless you have a really strong aversion to depending on extrnal modules,
I suggest using them instead of custom regexes.
One module which would work would be HTML::TokeParser :
http://search.cpan.org/dist/HTML-Parser/lib/HTML/TokeParser.pm
For example, the following code simply prints every TD content. If you run
this you'll note it "does the right thing"(tm) regarding the nested table:
############## >>> CODE START

use warnings;
use strict;
use HTML::TokeParser;

# Grab the HTML:
my $page_content = q(
<html>
<head><title>kuku</title></head>
<body>
<h1> My stuff </h1>
<TABLE border="5"><TR>
<TD>
    text
    b/w
    <TABLE border="2">
        <TR>
          <TD>more text</TD><TD>yet more text</TD>
       </TR>
    </TABLE>
</TD>
<td>foo</td><td>bar</td>
</TR>
<tr>
<td>2nd row</td>
</tr>
</TABLE>

</body>
</html>

);

# Parse the HTML
my $p = HTML::TokeParser->new(\$page_content);

# Search for a table column (td tag) and print the contents.
while (defined $p->get_tag("td") ){
   print $p->get_text,"\n";
}

############## >>> CODE END

If you want a solution that better preserves the table structure, this
module might be better:
http://search.cpan.org/dist/HTML-TableExtract/
Although I personally haven't used it...

Regards,
----------------------------------
Offer Kaye




More information about the Perl mailing list