[Israel.pm] HTML Tables Parsing with Perl

Mikhael Goikhman migo at homemail.com
Thu May 27 02:52:34 PDT 2004


On 27 May 2004 12:04:55 +0300, Yuval Yaari wrote:
> 
> Basically, I need to extract all the data from a <TD> ...
> But if there's a table inside that <TD>, HTML::TableContentParser fails.
> 
> Basically, I need:
> <TD>                 <---- From here
>    text
>    b/w
>    <TABLE>
>        <TR>
>          <TD>text</TD>
>       </TR>
>    </TABLE>
> </TD>                 <---- All the way to here, excluding the </TD>...
> 
> Any modules, scripts, regexes (???) would be highly appreciated.

If I don't want to be dependent on other modules (or as you said these
modules do not work), and I know there is no more than 2 nested <td>,
then I just write the following in my parsers:

  while ($text =~ m!(<td>((<td>.*?</td>|.+?)*?)</td>)!sig) {
    print "Process $1 as needed\n";
  }

And if I know there are no more than 3 levels, the regexp is:

  while ($text =~ m!(<td>((<td>.*?</td>|(<td>.*?</td>|.+?)*?))*?)</td>)!sig) {

I usually support at least 6-7 levels in my programs, it is really easy
once you figure out how such regexps are built; you should just replace
".+?" with a constant parenthesized string to support one more level. :)

You may also use (?:...) syntax to inform perl not to store all
substrings, I skipped these "?:" above for a better readability.

Regards,
Mikhael.

-- 
perl -e 'print+chr(64+hex)for+split//,d9b815c07f9b8d1e'



More information about the Perl mailing list