[Israel.pm] HTML wrapper induction
Shlomo Yona
shlomo at cs.haifa.ac.il
Wed Jun 9 06:51:41 PDT 2004
Hello,
I have a daily process running on a server.
Every time this process is invoked it performs:
1. download some HTML page from a given URL
2. parse the fetched HTML page
3. extract *some* of the URLS from the page
according to some criteria
3. foreach link in (extracted list of URLs)
4. fetch the HTML page
5. parse the HTML page
6. extract *relevant* text from the page
according to some criteria
7. store data (url,fetched HTML,extracted
data)
For now I defined the criteria used in #3 and the criteria
used in #6 manually, and then written apropriate Perl code
to generate a program which is able to perform the
extraction.
This approach has two serious disadvantages:
.A. The extraction criteria are good as long as page layout
doesn't change. Because I have no control over the source of
the HTML pages, I end up watching carefully every day that
the extraction succeeded, and in cases it fails, I need to
see if it is due to changes at the source and then re-design
the extraction criteria and re-write the extraction code.
.B. Adding more sources to this loop requires writing
special criteria for the new sources and the problems
mentioned in #A are not multiplied by the number of sources.
I would like to be able to "learn" the criteria (#3 and #6)
automatically and hopefully even automatically produce the
extraction code accordingly.
Do any of you know any systems doing such a thing?
--
Shlomo Yona
shlomo at cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/
More information about the Perl
mailing list