[Israel.pm] HTML wrapper induction

Shlomo Yona shlomo at cs.haifa.ac.il
Wed Jun 9 06:51:41 PDT 2004


Hello,

I have a daily process running on a server.
Every time this process is invoked it performs:

	1. download some HTML page from a given URL
	2. parse the fetched HTML page
	3. extract *some* of the URLS from the page
		according to some criteria
	3. foreach link in (extracted list of URLs)
		4. fetch the HTML page
		5. parse the HTML page 
		6. extract *relevant* text from the page
			according to some criteria
		7. store data (url,fetched HTML,extracted
			data)

For now I defined the criteria used in #3 and the criteria
used in #6 manually, and then written apropriate Perl code
to generate a program which is able to perform the
extraction.

This approach has two serious disadvantages:
.A. The extraction criteria are good as long as page layout
doesn't change. Because I have no control over the source of
the HTML pages, I end up watching carefully every day that
the extraction succeeded, and in cases it fails, I need to
see if it is due to changes at the source and then re-design
the extraction criteria and re-write the extraction code.
.B. Adding more sources to this loop requires writing
special criteria for the new sources and the problems
mentioned in #A are not multiplied by the number of sources.

I would like to be able to "learn" the criteria (#3 and #6)
automatically and hopefully even automatically produce the
extraction code accordingly.

Do any of you know any systems doing such a thing?



-- 
Shlomo Yona
shlomo at cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/



More information about the Perl mailing list