[Israel.pm] HTML wrapper induction

Shlomo Yona shlomo at cs.haifa.ac.il
Wed Jun 9 11:42:38 PDT 2004


On Wed, 9 Jun 2004, Gabor Szabo wrote:

> At what level of changes are you expecting to cover ?
> I mean if the site swaps its pages or is suddenly written in
> Yiddish, your are not expecting to get over with the automated tool right?

I'm expecting style changes rather than actual content
changes. Some things shouldn't change, like the amount of
text in a news article vs. the amount of text in
commercial/nagivational parts of the web page.

Nonetheless, you are asking a very good question. The
"features" to train by should be decided up front  and the
classifier will be as good (or as bad) as the features
selected.

> Can you create a regex for the text of the link that will withstand the
> possible changes ?

That's not what I'm after.
I'm after some agent that will be able to process several
annotated examples (say, a few dozens of pages where the
desired text is marked), and from then on produce the
correct code which can extract the desired texts.

This is, of course, instead of having someone manually
analyze these examples and then create the code.

> BTW how are you expecting something to learn if you can give only one
> example ?

I'm not talking about one example.


-- 
Shlomo Yona
shlomo at cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/



More information about the Perl mailing list