[Israel.pm] How to get TEXT from PDF ?

Roey Almog (Infoneto Ltd) almog at infoneto.co.il
Sun Jun 28 02:36:51 PDT 2009


Hi,

I tried using CAM::PDF to get text out of PDF's in the following way:

use CAM::PDF;
use CAM::PDF::PageText;
use strict;

my $pdf = CAM::PDF->new("demo.pdf");
my $pageone_tree = $pdf->getPageContentTree(1);
my $string = CAM::PDF::PageText->render($pageone_tree);
print $string;

It works for certain type of PDF's but most of the time I get things like:

\x01\x02\x03\x04\x05\x06\x07\x08\x02	

\x01\x02\x03\x04\x05\x06\x07\x06\x08	
\x04\x06\x0B\x04\x0C\x07
\x0E\x07	\x0B\x0E\x04\x0F\x0B\x10\x11

\x06\x12\x13\x0E\x08\x14\x15\x07
\x0E\x07	\x0B\x0E\x11\x16\x0E\x11\x15\x12

I tried checking if this just a simple mapping (like \x01 => A etc...)
and it is not consistent at all
the length of the lines does not match either.

Any one knows a better way to do PDF to Text using perl, or how to fix
or use correctly CAM::PDF ?

Roey


More information about the Perl mailing list