[Israel.pm] How to get TEXT from PDF ?

Levenglick Dov-RM07994 dov at freescale.com
Sun Jun 28 23:40:07 PDT 2009


The method text() in PDF::OCR2 looks to fit the bill

 
Best Regards,
Dov Levenglick
SmartDSP OS Development Leader

-----Original Message-----
From: perl-bounces at perl.org.il [mailto:perl-bounces at perl.org.il] On
Behalf Of Roey Almog (Infoneto Ltd)
Sent: Sunday, June 28, 2009 12:37
To: Perl in Israel
Subject: [Israel.pm] How to get TEXT from PDF ?

Hi,

I tried using CAM::PDF to get text out of PDF's in the following way:

use CAM::PDF;
use CAM::PDF::PageText;
use strict;

my $pdf = CAM::PDF->new("demo.pdf");
my $pageone_tree = $pdf->getPageContentTree(1);
my $string = CAM::PDF::PageText->render($pageone_tree);
print $string;

It works for certain type of PDF's but most of the time I get things
like:

\x01\x02\x03\x04\x05\x06\x07\x08\x02	

\x01\x02\x03\x04\x05\x06\x07\x06\x08	
\x04\x06\x0B\x04\x0C\x07
\x0E\x07	\x0B\x0E\x04\x0F\x0B\x10\x11

\x06\x12\x13\x0E\x08\x14\x15\x07
\x0E\x07	\x0B\x0E\x11\x16\x0E\x11\x15\x12

I tried checking if this just a simple mapping (like \x01 => A etc...)
and it is not consistent at all
the length of the lines does not match either.

Any one knows a better way to do PDF to Text using perl, or how to fix
or use correctly CAM::PDF ?

Roey
_______________________________________________
Perl mailing list
Perl at perl.org.il
http://mail.perl.org.il/mailman/listinfo/perl


More information about the Perl mailing list