[Israel.pm] screen scraping

JPeyser jpeyser at pobox.com
Thu Feb 5 14:05:06 PST 2004


The article sounds exciting but dissecting the html is such a pain. Some
sites are more straightforward than others. Here is a program that pulls
articles from www.perl.com based on year and month.

#!/usr/local/bin/perl

use WWW::Mechanize;
use HTML::TokeParser;
use strict;

### Get Options Section
use Getopt::Std;
use vars qw($opt_y $opt_m $opt_h);

getopts ('y:m:h');

my $year = $opt_y;
my $month = $opt_m;
my $help = $opt_h || 0;

my $usage = <<UsAgE;
Usage: $0 [-y year] [-m month]
       -y 4 digit year, e.g. 2004. Default is current year.
       -m the numeric month, e.g. 1. Default is current month.
       -h help. Displays this message
UsAgE

die $usage if $help;

### Target URL
my $url = 'http://www.perl.com/';

my $agent = WWW::Mechanize->new();
$agent->get("$url");

my $stream = HTML::TokeParser->new(\$agent->{content});

### Default Date is Today
my ($tag, $day, $dd, $mm, $yy, @articles, %articles, $article, $content,
$token);

(undef, undef, undef, $dd, $mm, $yy) = localtime();
$year ||= $yy + 1900;
$month ||= $mm + 1;
my $pattern = sprintf("${url}pub/a/%d/%02d/", $year, $month );

### Get the articles from specified month/year
my $i = 1;
while($tag = $stream->get_tag("a")) {
 next unless $tag->[1]{href} =~ qq!$pattern!;
 $tag->[1]{href} =~ m!$pattern(.*)$!;
 unless ($articles{$1}) {
  $articles{$1} = $i++;
  push @articles, $tag->[1]{href};
 }
}

### Print a quick Table of Contents
my $i = 1;
print '=' x 60, "\n";
foreach $article (sort { $articles{$a} <=> $articles{$b} } keys %articles) {
 print $i++, ". $article\n";
}
print '=' x 60, "\n";

### Get Articles
foreach $article (@articles) {
 $agent->get($article);
 $stream = HTML::TokeParser->new(\$agent->{content});
 my $header = 0;
 while($tag = $stream->get_token("h2", "h3", "p", "html")) {
  $token = $tag->[1];
  ### Doesn't seem to be any h1 tags!?
  if ($tag->[0] eq 'S' && $token eq 'h2') {
            $content = $stream->get_trimmed_text("\/$token");
   print "*** $content ***\n\n" if $header;
   print "\n", '-' x 40, "\n$content\n", '-' x 40, "\n" unless $header;
   $header = 1;
  } elsif ($tag->[0] eq 'S' && $token eq 'h3') {
   $header = 1;
   $content = $stream->get_trimmed_text("\/$token");
   print "--- $content ---\n\n"
  } elsif ($tag->[0] eq 'S' && $token eq 'p' && $header) {
   $content = $stream->get_trimmed_text("\/$token");
   print "$content\n\n";
  } elsif ($tag->[0] eq 'E' && $token eq 'html') {
   ### Article is finished at </html>!
   last;
  }
 }
}

Jonathan

----- Original Message ----- 
From: "Yuval Yaari" <yuval at windax.com>
To: "Perl in Israel" <perl at perl.org.il>
Sent: Thursday, February 05, 2004 4:37 AM
Subject: Re: [Israel.pm] screen scraping


> Read this:
> http://www.perl.com/pub/a/2003/01/22/mechanize.html
>
> Hopefully it would help you.
> Keep us updated :)
>
>   --Yuval
>
> Scott Weisman wrote:
>
> >hello all,
> >
> >i have a project where i need to write a screen scraper app. i see that
there
> >are two cpan modules that are both of recent vintage: WWW::Mechanize and
> >Scraper. Can anyone tell me more info on either module?
> >
> >thanks,
> >
> >scott
> >_______________________________________________
> >Perl mailing list
> >Perl at perl.org.il
> >http://www.perl.org.il/mailman/listinfo/perl
> >
> >YAPC::Israel::2004
> >http://www.perl.org.il/YAPC/2004/
> >
> >
> >
> >
>
> _______________________________________________
> Perl mailing list
> Perl at perl.org.il
> http://www.perl.org.il/mailman/listinfo/perl
>
> YAPC::Israel::2004
> http://www.perl.org.il/YAPC/2004/




More information about the Perl mailing list