[Israel.pm] screen scraping
JPeyser
jpeyser at pobox.com
Thu Feb 5 14:05:06 PST 2004
The article sounds exciting but dissecting the html is such a pain. Some
sites are more straightforward than others. Here is a program that pulls
articles from www.perl.com based on year and month.
#!/usr/local/bin/perl
use WWW::Mechanize;
use HTML::TokeParser;
use strict;
### Get Options Section
use Getopt::Std;
use vars qw($opt_y $opt_m $opt_h);
getopts ('y:m:h');
my $year = $opt_y;
my $month = $opt_m;
my $help = $opt_h || 0;
my $usage = <<UsAgE;
Usage: $0 [-y year] [-m month]
-y 4 digit year, e.g. 2004. Default is current year.
-m the numeric month, e.g. 1. Default is current month.
-h help. Displays this message
UsAgE
die $usage if $help;
### Target URL
my $url = 'http://www.perl.com/';
my $agent = WWW::Mechanize->new();
$agent->get("$url");
my $stream = HTML::TokeParser->new(\$agent->{content});
### Default Date is Today
my ($tag, $day, $dd, $mm, $yy, @articles, %articles, $article, $content,
$token);
(undef, undef, undef, $dd, $mm, $yy) = localtime();
$year ||= $yy + 1900;
$month ||= $mm + 1;
my $pattern = sprintf("${url}pub/a/%d/%02d/", $year, $month );
### Get the articles from specified month/year
my $i = 1;
while($tag = $stream->get_tag("a")) {
next unless $tag->[1]{href} =~ qq!$pattern!;
$tag->[1]{href} =~ m!$pattern(.*)$!;
unless ($articles{$1}) {
$articles{$1} = $i++;
push @articles, $tag->[1]{href};
}
}
### Print a quick Table of Contents
my $i = 1;
print '=' x 60, "\n";
foreach $article (sort { $articles{$a} <=> $articles{$b} } keys %articles) {
print $i++, ". $article\n";
}
print '=' x 60, "\n";
### Get Articles
foreach $article (@articles) {
$agent->get($article);
$stream = HTML::TokeParser->new(\$agent->{content});
my $header = 0;
while($tag = $stream->get_token("h2", "h3", "p", "html")) {
$token = $tag->[1];
### Doesn't seem to be any h1 tags!?
if ($tag->[0] eq 'S' && $token eq 'h2') {
$content = $stream->get_trimmed_text("\/$token");
print "*** $content ***\n\n" if $header;
print "\n", '-' x 40, "\n$content\n", '-' x 40, "\n" unless $header;
$header = 1;
} elsif ($tag->[0] eq 'S' && $token eq 'h3') {
$header = 1;
$content = $stream->get_trimmed_text("\/$token");
print "--- $content ---\n\n"
} elsif ($tag->[0] eq 'S' && $token eq 'p' && $header) {
$content = $stream->get_trimmed_text("\/$token");
print "$content\n\n";
} elsif ($tag->[0] eq 'E' && $token eq 'html') {
### Article is finished at </html>!
last;
}
}
}
Jonathan
----- Original Message -----
From: "Yuval Yaari" <yuval at windax.com>
To: "Perl in Israel" <perl at perl.org.il>
Sent: Thursday, February 05, 2004 4:37 AM
Subject: Re: [Israel.pm] screen scraping
> Read this:
> http://www.perl.com/pub/a/2003/01/22/mechanize.html
>
> Hopefully it would help you.
> Keep us updated :)
>
> --Yuval
>
> Scott Weisman wrote:
>
> >hello all,
> >
> >i have a project where i need to write a screen scraper app. i see that
there
> >are two cpan modules that are both of recent vintage: WWW::Mechanize and
> >Scraper. Can anyone tell me more info on either module?
> >
> >thanks,
> >
> >scott
> >_______________________________________________
> >Perl mailing list
> >Perl at perl.org.il
> >http://www.perl.org.il/mailman/listinfo/perl
> >
> >YAPC::Israel::2004
> >http://www.perl.org.il/YAPC/2004/
> >
> >
> >
> >
>
> _______________________________________________
> Perl mailing list
> Perl at perl.org.il
> http://www.perl.org.il/mailman/listinfo/perl
>
> YAPC::Israel::2004
> http://www.perl.org.il/YAPC/2004/
More information about the Perl
mailing list