[Israel.pm] wierd input/output problem

Shlomo Yona shlomo at cs.haifa.ac.il
Tue Jan 6 00:46:56 PST 2004


I have a Perl program which is used to extract text from
HTML. It uses HTML::TokeParser.

The HTML in the input is encoded in cp1255. The output
should be the same, as I'm doing nothing to change it.

On server A which runs GNU/linux (2.4.19-36mdksmp) the
output is just as expected.
On server B which also runs GNU/linux (2.4.20-27.8smp) the
output is being corrupted.

I think it should be something which has to do with the
default encoding on the system (for A it is probably
iso-8859-1, while for B it is probably UTF8). How can I get
to the bottom of this and understand what is it that causes
the corrupted output on server B (I know it is corrupted as
I run the same script on many other machines running an
ordinary GNU/Linux installation), so I can send an
informative error/complaint message to its sysadmins (which
normally do nothing to help understand how such things
occude and usually don't fix them unless you serve them the
exact solution on a silver plate).

I suspect that that's the problem of server B as a while ago
people complained that UTF-8 encoded web pages on the
server's website were not viewed well (although the HTML
properly stated that that's the encoding). When the
sysadmins "solved" it, the web pages looked fine and
required no action on behalf of the users to change their
browser's encoding, but at that point I believe (though I
cannot be sure) the behaviour of my Perl scripts began to
change when it came to I/O.


Shlomo Yona
shlomo at cs.haifa.ac.il

