[Israel.pm] Re: endash in Hebrew

Gaal Yahas gaal at forum2.org
Fri Mar 5 00:21:34 PST 2004


On Fri, Mar 05, 2004 at 07:54:03AM +0200, Yehuda Berlinger wrote:
> Thank you for the pointers; I am still trying to figure it out. After 
> wading through a few hundred pages of PDF files, I still haven't 
> found out how to get a Hebrew endash, let alone a Hebrew Aleph.
> 
> Is there a map for Hebrew characters to unicode numbers that will 
> actually work when I use chr(0xnnnn) somewhere? Or can someone give 
> me a more specific pointer to figuring out a specific character?

Don't confuse code points with data.

Unicode defines lots of characters and gives them an internal number.
When you see something like U+1234, that's what you're looking at. It
is character 1234 in the abstract unicode lists.

Now, to represent this character as data you need an encoding. The most
popular one in the unix world is UTF-8; if you're on Windows you will
see UTF-16 more often. The point is that U+1234 gets written differently
on different encodings, and chr() doesn't need the "unicode number"
(code point), but rather the value for that character in your encoding.

And how to get that? :)

You could find a standard for that encoding and look it up. An
alternative is to find an instance of that character in any known
encoding, convert it to your desired encoding, and look at a hex dump.

For example, suppose you know how to make an Aleph in "ASCII" -- not
really ascii, of course, because ascii never had Hebrew -- in the 8 bit
encoding known as ISO-8859-8. (This would put Aleph at 0xE0.) (The same
goes for Windows-1255 aka CP1255, which is very similar.) Create a file
with the aleph:

     % perl -e 'print "\xE0"' > aleph

and convert it, say, to UTF-8:

     % iconv -f ISO-8859-8 -t UTF-8 < aleph > aleph.utf8

Now view the hexdump:

     % od -x aleph.utf8
     0000000 90d7
     0000002

The first column is just locator information: ignore it.

As you can see, UTF-8 encodes Aleph as 0x90D7.

The iconv command is available on most unix systems. The default encoding
(if you don't specify -f or -t) is UTF-8. od(1) likewise exists on most
unixes. If you're on Windows or another platform, you'll have to find
other tools.

Hope this helps,
Gaal

-- 
Gaal Yahas <gaal at forum2.org>
http://gaal.livejournal.com/



More information about the Perl mailing list