[Israel.pm] Telling encoding of a string

Yuval Kogman lists at woobling.org
Tue Jan 27 12:14:56 PST 2004


On Tue, Jan 27, 2004 at 09:27:51PM +0200, Shahar Evron wrote:
> Is there any efficient way of checking weather a hebrew string is in 
> UTF-8 or some other encoding?
> 
> I need to convert a series of Hebrew strings to unicode, but some of the 
> strings are UTF-8 and some are ISO-8859-8 or cp1255 - ofcourse I don't 
> want to convert the strings already in UTF-8.

That's a larger problem than it seems - sadly most encodings are
designed to be space efficient. That way, an octet stream, interpreted
validly in one encoding, may be interpreted validly in another, but
when a conversion between the two results is made, they may not be
equal.

I suggest you do not search for an efficient solution, but a reliable
one, preferably interrogating whoever collected the strings...

Otherwise, if you know the string has some hebrew in it, 

	{ use bytes;
		foreach my $octet (unpack("C*", $foo)){
			warn "string is UTF hebrew" if $octet == 215;
		}
	}

or try Encode::Guess, if you're not overly concerned with speed.

-- 
 ()  Yuval Kogman <nothingmuch at woobling.org> 0xEBD27418  perl hacker &
 /\  kung foo master: /me does a karate-chop-flip: neeyah!!!!!!!!!!!!!!




More information about the Perl mailing list