[Israel.pm] Unicode un-handling

Mikhael Goikhman migo at homemail.com
Tue Apr 8 15:12:12 PDT 2008


On 09 Apr 2008 00:29:20 +0300, Shmuel Fomberg wrote:
> 
> I'm writing my site in Hebrew/Unicode.
> Actually, as much as I can see, the processing of the pages is not does 
> in unicode. The Tamplate::Toolkit is loading the files without the :utf8 
> modifier, process the file as if it were normal ascii.
> The data that come from the DB probably is not marked as unicode either. 
> so all the fields are being entered to the tamplate as byte sequence.
> 
> And it all works. somehow.
> 
> But then I tried to enter data that is marked as utf8:
> my $check_encoding_mark = decode("utf8", pack "H*", 'c3a4e284a2c2ae');
> as one of the fields in the tamplate.
> 
> suddenly, all the hebrew turned to something that look like:
> ×?×?×¥ ×?×?×? ×?×?×^(a)× ×^(a)ק×?×^(a)
> should be "press here to disconnect", in hebrew.
> 
> My guess is that when adding a utf8-marked data, Perl tried to convert 
> the old data from (latin-1?) to utf8.
> Is that correct?
> 
> I think that I should mark everything as utf8. I use:
> CGI::Application
> CGI::Application::Plugin::AnyTemplate - Tamplate Toolkit
> Class::DBI
> 
> Can anyone help convincing these modules to grok utf8?

You may Google for "perl utf8" to find many explanations of the problem.

  http://ahinea.com/en/tech/perl-unicode-struggle.html

The problem is pretty simple. You concatinate 2 non-ascii strings, the
first of which is utf8-flagged and the second is not. To solve the
problem you should just "decode_utf8" the first one or "encode_utf8" the
second one before their concatination. Use "encoding::warnings" to find
where you do an implicit encoding conversions (CPAN or perl-5.10.0).

Which one to use depends on the situation. Usually decode_utf8 is
preferable over encode_utf8, since you may then do such operations like
substr or printf "%9s" properly. But on the other hand, then you probably
should set STDOUT to handle utf8-flagged data to avoid a warning from
print. Using: use encoding STDOUT => "utf8"; or: use encoding ':locale';
or: no warnings 'utf8';.

Regards,
Mikhael.

-- 
perl -e 'print+chr(64+hex)for+split//,d9b815c07f9b8d1e'



More information about the Perl mailing list