[Israel.pm] about utf8

Shmuel Fomberg semuelf at 012.net.il
Sun Jan 18 10:56:54 PST 2009


The question should be: what I've been thinking?
OK. Here is another try.

The first byte dictates how many bytes there are:
first & 0x80 == 0 => one byte
first & 0xE0 == 0xC0 => two bytes
first & 0xF0 == 0xE0 => three bytes
first & 0xF8 == 0xF0 => four bytes

And for every other byte in the character:
byte & 0xC0 == 0x80

For your example, the first byte starts with an 'E', meaning three 
bytes. the rest of the bytes starts with '8' and 'A' - OK.

Shmuel.

Gaal Yahas wrote:
> Are you sure? €, U+20AC is represented in UTF-8 as 0xE2, 0x82, 0xAC.
> The middle byte & 0xC0 == 0.
> 
> On Sun, Jan 18, 2009 at 7:26 PM, Shmuel Fomberg <semuelf at 012.net.il> wrote:
>> Hi.
>>
>> I've been reading a bit about utf8, and I learned that when reading a
>> utf8 character, for each byte I need to check:
>> (byte & 0xC0 ) == 0xC0
>> means that there is another byte for this character. Otherwise, it's the
>>  last byte of the character.
>>
>> Shmuel.




More information about the Perl mailing list