[Israel.pm] HTML Tables Parsing with Perl
Mikhael Goikhman
migo at homemail.com
Thu May 27 02:52:34 PDT 2004
On 27 May 2004 12:04:55 +0300, Yuval Yaari wrote:
>
> Basically, I need to extract all the data from a <TD> ...
> But if there's a table inside that <TD>, HTML::TableContentParser fails.
>
> Basically, I need:
> <TD> <---- From here
> text
> b/w
> <TABLE>
> <TR>
> <TD>text</TD>
> </TR>
> </TABLE>
> </TD> <---- All the way to here, excluding the </TD>...
>
> Any modules, scripts, regexes (???) would be highly appreciated.
If I don't want to be dependent on other modules (or as you said these
modules do not work), and I know there is no more than 2 nested <td>,
then I just write the following in my parsers:
while ($text =~ m!(<td>((<td>.*?</td>|.+?)*?)</td>)!sig) {
print "Process $1 as needed\n";
}
And if I know there are no more than 3 levels, the regexp is:
while ($text =~ m!(<td>((<td>.*?</td>|(<td>.*?</td>|.+?)*?))*?)</td>)!sig) {
I usually support at least 6-7 levels in my programs, it is really easy
once you figure out how such regexps are built; you should just replace
".+?" with a constant parenthesized string to support one more level. :)
You may also use (?:...) syntax to inform perl not to store all
substrings, I skipped these "?:" above for a better readability.
Regards,
Mikhael.
--
perl -e 'print+chr(64+hex)for+split//,d9b815c07f9b8d1e'
More information about the Perl
mailing list