[Israel.pm] Perl unicode question

Meir Guttman meir at guttman.co.il
Mon Feb 13 02:55:19 PST 2012

Dear Yitzchak,

-----Original Message-----
From: perl-bounces at perl.org.il [mailto:perl-bounces at perl.org.il] On Behalf Of Issac Goldstand
Sent: Monday, February 13, 2012 12:30 PM
To: Perl in Israel
Subject: [Israel.pm] Perl unicode question

If there's one thing I can never seem to get straight, it's character encodings...

I'm trying to parse some data from the web which can come in different encodings, and write unit tests which come from static files.

One of the strings that I'm trying to test for is "Forex Trading Avec 100€"  The string is originally encoded (supposedly) in ISO-8859-1 based on the header Content-Type: text/html; charset=ISO-8859-1 and presence of the following META tag <meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">

(N.B. I'm a bit confused by that as IIRC, ISO-8859-1 doesn't contain the EUR character...)

When opening the source code in a text editor as either ISO-8859-1 or
ISO-8859-15 (or even UTF-8), I can't see the character.  I *do* see the character when viewing it as CP1255 which kinda worries me, as I get the feeling I'm a lot farther from the source as I think when I see that...

My unit test for above test is as following:

use utf8; # String literals contain UTF-8 in this file binmode STDOUT ":utf8"; ...
open($fh, "<:encoding(ISO-8859-1)", "t/html0004.html") || die "...: $!"; $parser->parse_file($fh); # Subclassed HTML::Parser ...
is($test->{top}, "Forex Trading Avec 100€", "Correct headline text");

However, this test does not pass on the EURO, giving me the following
Wide character in print at /usr/local/share/perl/5.12.4/Test/Builder.pm
line 1759.
#          got: 'Forex Trading Avec 100€'
#     expected: 'Forex Trading Avec 100€'

Both the warning and the mismatch bother me....  The warning, because I assumed that opening STDOUT as a utf8 stream would deal with it.  And the mismatch, because I can't figure why it's mismatching...

FWIW, when doing this on the web, I'd planned on converting to utf-8 by using HTTP::Response's $res->decoded_content to deal with the encoding for me, but that seems to be spewing characters that... don't look correct... too :/

Any ideas?


There are a number of things that must be done together so that Unicode will be supported. And don't put too much weight on the "charset..." cluse in the HTML.

Since this list does not accept attachments, I'll send to your personal address my upcoming presentation on "Unicode aspects in Perl", to be presented in the Israel Perl Workshop 2012 (http://act.perl.org.il/ilpw2012/).

Anybody else who is interested is welcomed to ask and I'll send it to her/him too.

More information about the Perl mailing list