[Israel.pm] HTML Tables Parsing with Perl

Shlomi Fish shlomif at iglu.org.il
Thu May 27 04:22:35 PDT 2004


On Thursday 27 May 2004 12:04, Yuval Yaari wrote:
> Hi,
>
> I know this sounds simple, and of course I didn't try to re-invent the
> wheel BUT I used HTML::TableContentParser...
> Which doesn't really work well for me :)
>
> Basically, I need to extract all the data from a <TD> ...
> But if there's a table inside that <TD>, HTML::TableContentParser fails.
>
> Basically, I need:
> <TD>                 <---- From here
>     text
>     b/w
>     <TABLE>
>         <TR>
>           <TD>text</TD>
>        </TR>
>     </TABLE>
> </TD>                 <---- All the way to here, excluding the </TD>...
>
> So as you see, I can't be 100% sure that there won't be any <TABLE>s
> inside that <TD> (though I do want them...).
> There may also be <TD>'s before/after the specific <TD> I'm looking for,
> so I wasn't able to write a regex.
>
> Any modules, scripts, regexes (???) would be highly appreciated.
>

Check out this script I wrote using HTML::TokeParser, which prints out each 
<td> in turn while including all the nested tables.

<<<<<<<<<<<<<<<<<<
#!/usr/bin/perl -w

use strict;

use HTML::TokeParser;

sub get_element_text
{
    my $elem = shift;
    if ($elem->[0] eq "T")
    {
        return $elem->[1];
    }
    else
    {
        return $elem->[-1];
    }
}

sub is_start_tag
{
    my $token = shift;
    my $type = shift;

    return (($token->[0] eq "S") && (lc($token->[1]) eq $type));
}

sub is_end_tag
{
    my $token = shift;
    my $type = shift;

    return (($token->[0] eq "E") && (lc($token->[1]) eq $type));
}

my $filename = shift || "test.html";

my $p = HTML::TokeParser->new($filename);
if (! $p)
{
    die "Cannot open: $!";
}

while (my $token = $p->get_token())
{    
    # This should be changed to point to the 
    # real table that we wish to process
    if (is_start_tag($token, "table"))
    {
        my $td_idx = 0;
        TABLE_LOOP: while ($token = $p->get_token())
        {
            if (is_start_tag($token, "td"))
            {
                # Process until the end of the token.
                my $num_nested_tables = 0;
                my $text = "";
                TD_LOOP: while ($token = $p->get_token())
                {
                    if (is_start_tag($token, "table"))
                    {
                        $num_nested_tables++;                        
                    }
                    elsif (is_end_tag($token, "table"))
                    {
                        $num_nested_tables--;
                    }
                    elsif (is_end_tag($token, "td") &&
                           ($num_nested_tables == 0))
                    {
                        last TD_LOOP;
                    }
                    $text .= get_element_text($token);
                }
                print "TD No. ${td_idx}:\n";
                print "--------------\n";
                print "$text\n";
                print "--------------\n\n";
                $td_idx++;
            }
            elsif (is_end_tag($token, "table"))
            {
                last TABLE_LOOP;
            }
        }
        exit(0);
    }
}
>>>>>>>>>>>>>>>>>>

Regards,

	Shlomi Fish

>   --Yuval
>
> _______________________________________________
> Perl mailing list
> Perl at perl.org.il
> http://www.perl.org.il/mailman/listinfo/perl

-- 

---------------------------------------------------------------------
Shlomi Fish      shlomif at iglu.org.il
Homepage:        http://shlomif.il.eu.org/

Quidquid latine dictum sit, altum viditur.
        [Whatever is said in Latin sounds profound.]



More information about the Perl mailing list