[Israel.pm] Perl unicode question

Issac Goldstand margol at beamartyr.net
Mon Feb 13 03:12:00 PST 2012


On 13/02/2012 12:54, Gaal Yahas wrote:
>
> On Mon, Feb 13, 2012 at 12:30 PM, Issac Goldstand
> <margol at beamartyr.net <mailto:margol at beamartyr.net>> wrote:
>
>     If there's one thing I can never seem to get straight, it's character
>     encodings...
>
>     I'm trying to parse some data from the web which can come in different
>     encodings, and write unit tests which come from static files.
>
>     One of the strings that I'm trying to test for is "Forex Trading Avec
>     100EUR"  The string is originally encoded (supposedly) in
>     ISO-8859-1 based
>     on the header Content-Type: text/html; charset=ISO-8859-1 and presence
>     of the following META tag <meta http-equiv="Content-Type"
>     content="text/html; charset=ISO-8859-1">
>
>
> When dealing with encoding problems, it's helpful to isolate the
> problem as much as you can. Every piece that reports on an encoding
> can get it wrong, and the fact that both the server and the document
> claim it's 8859-1 doesn't mean it they aren't lying. So start by
> fetching the document in raw form with curl or wget, and open that
> with "od -t x1a".
That gave me HEX format, which I don't understand how it'd really help
(unless I got lucky and found a BOM at the start)...
>  
>
>     (N.B. I'm a bit confused by that as IIRC, ISO-8859-1 doesn't
>     contain the
>     EUR character...)
>
>
> The standard predates the currency.
I know - I meant it seemed odd that the document could *be* ISO-8859-1
given that fact.
>  
>
>     When opening the source code in a text editor as either ISO-8859-1 or
>     ISO-8859-15 (or even UTF-8), I can't see the character.  I *do*
>     see the
>     character when viewing it as CP1255 which kinda worries me, as I
>     get the
>     feeling I'm a lot farther from the source as I think when I see
>     that...
>
>
> Sounds like you actually have the problem in your hands: somebody
> misencoded the data.
>  
>
>     My unit test for above test is as following:
>
>     use utf8; # String literals contain UTF-8 in this file
>     binmode STDOUT ":utf8";
>     ...
>     open($fh, "<:encoding(ISO-8859-1)", "t/html0004.html") || die
>     "...: $!";
>     $parser->parse_file($fh); # Subclassed HTML::Parser
>     ...
>     is($test->{top}, "Forex Trading Avec 100EUR", "Correct headline
>     text");
>
>
> If you tweak your code to use cp1255 (which encodes Euro as 0x80),
> does it pass? I expect it should, confirming the problem. 
>
>
It failed some other tests adding hebrew chars instead of accents. 
CP1252 seemed to work, but this bothers me as I'm still doing human
guess-work, and this would (and, indeed, does) still cause problems in
the production code which has only LWP's output to work with.  And LWP
goes by the character codes presented by the document from what I can see:

(Message.pm line 359 from HTTP::Message)
    if ($self->content_is_text || (my $is_xml = $self->content_is_xml)) {
        my $charset = lc(
            $opt{charset} ||
        $self->content_type_charset ||
        $opt{default_charset} ||
        $self->content_charset ||
        "ISO-8859-1"
        );

Do you know a better way to guess the real content-type?  The browsers
do it somehow...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.perl.org.il/pipermail/perl/attachments/20120213/8c81ac5b/attachment-0001.htm 


More information about the Perl mailing list