[Israel.pm] pdf2txt ps2txt

Shlomo Yona shlomo at cs.haifa.ac.il
Wed Nov 3 22:36:41 PST 2004


Hello,

I have a few dozens of PDF files containing Hebrew texts
with niqqud and images. These are actually the issues of 
sha'ar lamatxil (see:
http://www.slamathil.co.il/defaultHeb.htm).

I need to extract the Hebrew text (including the niqqud)
from the PDF files, in order to further manipulate them.

I've tried pdf2ps and then ps2ascii (these are utilities I
found on my Mandrake 9.1) but though the pdf2ps produced a
valid postscript file that looks like the original PDF file,
the second step was a complete failure, as it produced a
small file with blanks and a few control characters.

Can you suggest a method for extracting the texts (with the
niqqud) from the PDF files?

Thanks.

-- 
Shlomo Yona
shlomo at cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/



More information about the Perl mailing list