[Israel.pm] about utf8

Gaal Yahas gaal at forum2.org
Sun Jan 18 11:04:44 PST 2009


Yes, that's true. The encoding is also is self-synchronizing, and if
there's one bad character somewhere, you only have to move a little
bit forward in the stream to get good data again. Also, many C library
functions that were designed for ASCII continue to work with UTF-8
without change. It's a nifty encoding.

(Heh, my reply earlier contained an error too; I meant to say middle
byte & 0xC0 != 0xC0, not that it == 0.)

On Sun, Jan 18, 2009 at 8:56 PM, Shmuel Fomberg <semuelf at 012.net.il> wrote:
>
> The question should be: what I've been thinking?
> OK. Here is another try.
>
> The first byte dictates how many bytes there are:
> first & 0x80 == 0 => one byte
> first & 0xE0 == 0xC0 => two bytes
> first & 0xF0 == 0xE0 => three bytes
> first & 0xF8 == 0xF0 => four bytes
>
> And for every other byte in the character:
> byte & 0xC0 == 0x80
>
> For your example, the first byte starts with an 'E', meaning three
> bytes. the rest of the bytes starts with '8' and 'A' - OK.
>
> Shmuel.
>
> Gaal Yahas wrote:
>> Are you sure? €, U+20AC is represented in UTF-8 as 0xE2, 0x82, 0xAC.
>> The middle byte & 0xC0 == 0.
>>
>> On Sun, Jan 18, 2009 at 7:26 PM, Shmuel Fomberg <semuelf at 012.net.il> wrote:
>>> Hi.
>>>
>>> I've been reading a bit about utf8, and I learned that when reading a
>>> utf8 character, for each byte I need to check:
>>> (byte & 0xC0 ) == 0xC0
>>> means that there is another byte for this character. Otherwise, it's the
>>>  last byte of the character.
>>>
>>> Shmuel.
>
> _______________________________________________
> Perl mailing list
> Perl at perl.org.il
> http://perl.org.il/mailman/listinfo/perl
>



-- 
Gaal Yahas <gaal at forum2.org>
http://gaal.livejournal.com/



More information about the Perl mailing list