[Israel.pm] pdf2txt ps2txt

Offer Kaye offer.kaye at gmail.com
Thu Nov 4 01:39:18 PST 2004


On Thu, 4 Nov 2004 08:36:41 +0200 (IST), Shlomo Yona
<shlomo at cs.haifa.ac.il> wrote:
> I need to extract the Hebrew text (including the niqqud)
> from the PDF files, in order to further manipulate them.
> 

No Perl solutions, I'm afraid, but:

1. Have you tried to look at the output of "strings"? Depending on
your locale and terminal abilities, it might actually generate
something worth looking at :-)

2. There is this project:
http://pdftohtml.sourceforge.net/
It might not preserve the nikud, but since it converts to XML (or
HTML), it might work, at least partially.

3. ps2html might work:
http://www.csd.uch.gr/~nikop/thesis.html

4. Scribus:
http://www.scribus.org.uk/
is an Open Source Desktop Publishing system for Linux. I included it
in the list of posssible tools because the site says that "Other
features include PDF Import, EPS import/export, Unicode text including
right to left scripts such as Arabic and Hebrew."
So it might be useful to you.

Good luck :-)
-- 
Offer Kaye



More information about the Perl mailing list