[Israel.pm] HTML Tables Parsing with Perl

Omer Zak omerz at actcom.co.il
Thu May 27 02:17:25 PDT 2004

I suggest that you follow the route of converting your HTML files into 
legal XML and then use XML packages to process your stuff.

The following keywords should help you:
tidy (which cleans up HTML files and converts them into legal XML).

In Python (my specialty), there are packages which deal with DOM and 
XPath (such as 4Suite), but about Perl I don't know what is available. 
However I am confident that Perl is as rich as Python (if not richer) in 
this department.

Yuval Yaari wrote:
> Hi,
> I know this sounds simple, and of course I didn't try to re-invent the 
> wheel BUT I used HTML::TableContentParser...
> Which doesn't really work well for me :)
> Basically, I need to extract all the data from a <TD> ...
> But if there's a table inside that <TD>, HTML::TableContentParser fails.
> Basically, I need:
> <TD>                 <---- From here
>    text
>    b/w
>    <TABLE>
>        <TR>
>          <TD>text</TD>
>       </TR>
>    </TABLE>
> </TD>                 <---- All the way to here, excluding the </TD>...
> So as you see, I can't be 100% sure that there won't be any <TABLE>s 
> inside that <TD> (though I do want them...).
> There may also be <TD>'s before/after the specific <TD> I'm looking for, 
> so I wasn't able to write a regex.
> Any modules, scripts, regexes (???) would be highly appreciated.

