[Israel.pm] pdf2txt ps2txt
Offer Kaye
offer.kaye at gmail.com
Thu Nov 4 01:39:18 PST 2004
On Thu, 4 Nov 2004 08:36:41 +0200 (IST), Shlomo Yona
<shlomo at cs.haifa.ac.il> wrote:
> I need to extract the Hebrew text (including the niqqud)
> from the PDF files, in order to further manipulate them.
>
No Perl solutions, I'm afraid, but:
1. Have you tried to look at the output of "strings"? Depending on
your locale and terminal abilities, it might actually generate
something worth looking at :-)
2. There is this project:
http://pdftohtml.sourceforge.net/
It might not preserve the nikud, but since it converts to XML (or
HTML), it might work, at least partially.
3. ps2html might work:
http://www.csd.uch.gr/~nikop/thesis.html
4. Scribus:
http://www.scribus.org.uk/
is an Open Source Desktop Publishing system for Linux. I included it
in the list of posssible tools because the site says that "Other
features include PDF Import, EPS import/export, Unicode text including
right to left scripts such as Arabic and Hebrew."
So it might be useful to you.
Good luck :-)
--
Offer Kaye
More information about the Perl
mailing list