[Israel.pm] HTML wrapper induction

Issac Goldstand margol at beamartyr.net
Wed Jun 9 07:12:48 PDT 2004

Why not use HTML::Parser?  Or if you want a shortcut for (3), try
HTML::SimpleLinkExtor.  Or am I missing the point?


----- Original Message ----- 
From: "Shlomo Yona" <shlomo at cs.haifa.ac.il>
To: <perl at perl.org.il>
Sent: Wednesday, June 09, 2004 4:51 PM
Subject: [Israel.pm] HTML wrapper induction

> Hello,
> I have a daily process running on a server.
> Every time this process is invoked it performs:
> 1. download some HTML page from a given URL
> 2. parse the fetched HTML page
> 3. extract *some* of the URLS from the page
> according to some criteria
> 3. foreach link in (extracted list of URLs)
> 4. fetch the HTML page
> 5. parse the HTML page
> 6. extract *relevant* text from the page
> according to some criteria
> 7. store data (url,fetched HTML,extracted
> data)
> For now I defined the criteria used in #3 and the criteria
> used in #6 manually, and then written apropriate Perl code
> to generate a program which is able to perform the
> extraction.
> This approach has two serious disadvantages:
> .A. The extraction criteria are good as long as page layout
> doesn't change. Because I have no control over the source of
> the HTML pages, I end up watching carefully every day that
> the extraction succeeded, and in cases it fails, I need to
> see if it is due to changes at the source and then re-design
> the extraction criteria and re-write the extraction code.
> .B. Adding more sources to this loop requires writing
> special criteria for the new sources and the problems
> mentioned in #A are not multiplied by the number of sources.
> I would like to be able to "learn" the criteria (#3 and #6)
> automatically and hopefully even automatically produce the
> extraction code accordingly.
> Do any of you know any systems doing such a thing?
> -- 
> Shlomo Yona
> shlomo at cs.haifa.ac.il
> http://cs.haifa.ac.il/~shlomo/
> _______________________________________________
> Perl mailing list
> Perl at perl.org.il
> http://www.perl.org.il/mailman/listinfo/perl

More information about the Perl mailing list