[Israel.pm] Perl unicode question

Issac Goldstand margol at beamartyr.net
Mon Feb 13 02:30:23 PST 2012

If there's one thing I can never seem to get straight, it's character

I'm trying to parse some data from the web which can come in different
encodings, and write unit tests which come from static files.

One of the strings that I'm trying to test for is "Forex Trading Avec
100€"  The string is originally encoded (supposedly) in ISO-8859-1 based
on the header Content-Type: text/html; charset=ISO-8859-1 and presence
of the following META tag <meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">

(N.B. I'm a bit confused by that as IIRC, ISO-8859-1 doesn't contain the
EUR character...)

When opening the source code in a text editor as either ISO-8859-1 or
ISO-8859-15 (or even UTF-8), I can't see the character.  I *do* see the
character when viewing it as CP1255 which kinda worries me, as I get the
feeling I'm a lot farther from the source as I think when I see that...

My unit test for above test is as following:

use utf8; # String literals contain UTF-8 in this file
binmode STDOUT ":utf8";
open($fh, "<:encoding(ISO-8859-1)", "t/html0004.html") || die "...: $!";
$parser->parse_file($fh); # Subclassed HTML::Parser
is($test->{top}, "Forex Trading Avec 100€", "Correct headline text");

However, this test does not pass on the EURO, giving me the following
Wide character in print at /usr/local/share/perl/5.12.4/Test/Builder.pm
line 1759.
#          got: 'Forex Trading Avec 100€'
#     expected: 'Forex Trading Avec 100€'

Both the warning and the mismatch bother me....  The warning, because I
assumed that opening STDOUT as a utf8 stream would deal with it.  And
the mismatch, because I can't figure why it's mismatching...

FWIW, when doing this on the web, I'd planned on converting to utf-8 by
using HTTP::Response's $res->decoded_content to deal with the encoding
for me, but that seems to be spewing characters that... don't look
correct... too :/

Any ideas?


More information about the Perl mailing list