[Israel.pm] How to get TEXT from PDF ?

Levenglick Dov-RM07994 dov at freescale.com
Sun Jun 28 23:40:07 PDT 2009

The method text() in PDF::OCR2 looks to fit the bill

Best Regards,
Dov Levenglick
SmartDSP OS Development Leader

-----Original Message-----
From: perl-bounces at perl.org.il [mailto:perl-bounces at perl.org.il] On
Behalf Of Roey Almog (Infoneto Ltd)
Sent: Sunday, June 28, 2009 12:37
To: Perl in Israel
Subject: [Israel.pm] How to get TEXT from PDF ?


I tried using CAM::PDF to get text out of PDF's in the following way:

use CAM::PDF;
use CAM::PDF::PageText;
use strict;

my $pdf = CAM::PDF->new("demo.pdf");
my $pageone_tree = $pdf->getPageContentTree(1);
my $string = CAM::PDF::PageText->render($pageone_tree);
print $string;

It works for certain type of PDF's but most of the time I get things


\x0E\x07	\x0B\x0E\x04\x0F\x0B\x10\x11

\x0E\x07	\x0B\x0E\x11\x16\x0E\x11\x15\x12

I tried checking if this just a simple mapping (like \x01 => A etc...)
and it is not consistent at all
the length of the lines does not match either.

Any one knows a better way to do PDF to Text using perl, or how to fix
or use correctly CAM::PDF ?

Perl mailing list
Perl at perl.org.il

More information about the Perl mailing list