[Israel.pm] Encoding Question

Meir Guttman meir at guttman.co.il
Fri Oct 12 04:20:18 PDT 2012


Hi Gaal,

I am sorry to say, but I am not familiar with MongoDB, only with MySQL.

In MySQL you have to specify what encoding are you storing text, what encoding your current input is, etc., although one can specify the default encoding, usually UTF-8. In particular, when you use DBI and you create a “connection” to the DB, you must specify in the “connect attributes”, among other things, also to enable utf-8, like this:

my %conn_attrs = (RaiseError  => 1,

                  PrintError  => 0,

                  AutoCommit  => 1,

                  mysql_enable_utf8  => 1);

Discovering this was rather long, frustrating and took a lonk time!

My be there is a similar attribute in MongoDB?

I am afraid that this is the only help I can provide...

Meir

 

From: perl-bounces at perl.org.il [mailto:perl-bounces at perl.org.il] On Behalf Of ynon perek
Sent: יום ו 12 אוקטובר 2012 12:56
To: Perl in Israel
Subject: Re: [Israel.pm] Encoding Question

 

Hi,

 

(here's the long story)

 

Printing the string yields the correct result, problem is afterwards.

 

I used this code inside a Dancer route handler, now when I just printed out the string to a file or screen everything worked great.

 

But, when I returned it to the browser, I got the wrong encoding.

Moreover, if I wrote it into a file, and then used 'send_file' method to send the file, everything was OK (correct encoding).

 

So that got me thinking it's a Dancer issue, which led me to sawyer. He explained that  Dancer tries to detect the encoding of strings, and if it's not UTF-8 it will encode it to utf-8.  

He suggested I tried to decode my string before returning it to Dancer, which worked very well.

 

We ended up wondering why Dancer failed to detect my string was already utf-8 encoded. 

I got the string from a MongoDB query, and then used lib::XML to create a sitemap with it. 

 

I tried to reproduce, but found that if I declare the string in my perl code everything works, so it's probably related to the MongoDB query (perhaps mongo returns just the bytes, so it wasn't marked as utf-8 and then Dancer failed to detect that it was already encoded).

 

Around this step I was happy to have a working sitemap.xml for my website (mobileweb.ynonperek.com/sitemap.xml) and moved on :)

 

Cheers,

  Ynon

 

 

 

 

On 12 October 2012 09:10, Gaal Yahas <gaal at forum2.org> wrote:

Hold on. The string you already had, the dump of which you gave us, was already okay, or close enough to it. What happens if you tried just printing it (not with Data::Dumper)?

I'm asking because I don't see any UTF-8 specifically, I just see a bunch of code points. The string is "הצגת-מפ", which you can easily see by looking up some characters in a Unicode table. You didn't show us any evidence of UTF-8 overencoding; if there was some, we'd be seeing the values 0xd7 0x94 etc. (the UTF-8 encoding of the abstract code point U+05d4).

 

I think it's Dumper that was escaping things because it wasn't sure your terminal could display them or whatever. Just try "print $buf".

 

 

On Fri, Oct 12, 2012 at 12:40 AM, ynon perek <ynonperek at gmail.com> wrote:

Hi All,

Thanks for all the help. 

 

Problem was in fact the opposite - double encoding (turned out both lib::XML and Dancer encode to utf-8...)

 

I ended up using decode('utf-8') on the data before passing it on, and this solved the issue (so now I have encode -> decode -> encode chain... which is why abstractions are evil).

 

Have a great weekend, 

  Ynon

 

On 11 October 2012 18:49, Meir Guttman <meir at guttman.co.il> wrote:

Hey Gaal,

I would look up Data::Dumper::AutoEncode (http://search.cpan.org/~bayashi/Data-Dumper-AutoEncode-0.102/lib/Data/Dumper/AutoEncode.pm). You can then use ‘eDumper’ rather than Dumper to actually see letters. This package also enables you to use any encoding you want. (The default though in utf8.)

Meir

 

From: perl-bounces at perl.org.il [mailto:perl-bounces at perl.org.il] On Behalf Of Gaal Yahas
Sent: יום ה 11 אוקטובר 2012 17:03
To: Perl in Israel
Subject: Re: [Israel.pm] Encoding Question

 

U+05d4 is HEBREW LETTER HE etc. -- your buffer is already in Unicode.

On Thu, Oct 11, 2012 at 4:51 PM, ynon perek <ynonperek at gmail.com> wrote:

Hi All,

 

Quick encoding question: I have  a text string that I think is in cp1255, because when I print it with Data::Dumper I get:

 

\x{5d4}\x{5e6}\x{5d2}\x{5ea}-\x{5de}\x{5e4}




But, when I try to decode it using:

 

my $decoded = decode('CP1255', $text);

 

I get this error:

 
 
Wide character in subroutine entry at /Users/ynonperek/perl5/perlbrew/perls/perl-5.14.2/lib/5.14.2/darwin-2level/Encode.pm line 174, <DATA> line 16.

Ideas ?

 

-- 


כותב הרצאות ? מדבר מול קהל ? הבלוג שלי  <http://publicspeakr.blogspot.com/> לומד לדבר כתוב במיוחד בשבילך.

 


_______________________________________________
Perl mailing list
Perl at perl.org.il
http://mail.perl.org.il/mailman/listinfo/perl





 

-- 
Gaal Yahas <gaal at forum2.org>
http://gaal.livejournal.com/


_______________________________________________
Perl mailing list
Perl at perl.org.il
http://mail.perl.org.il/mailman/listinfo/perl





 

-- 


כותב הרצאות ? מדבר מול קהל ? הבלוג שלי  <http://publicspeakr.blogspot.com/> לומד לדבר כתוב במיוחד בשבילך.

 


_______________________________________________
Perl mailing list
Perl at perl.org.il
http://mail.perl.org.il/mailman/listinfo/perl





 

-- 
Gaal Yahas <gaal at forum2.org>
http://gaal.livejournal.com/


_______________________________________________
Perl mailing list
Perl at perl.org.il
http://mail.perl.org.il/mailman/listinfo/perl





 

-- 


כותב הרצאות ? מדבר מול קהל ? הבלוג שלי  <http://publicspeakr.blogspot.com/> לומד לדבר כתוב במיוחד בשבילך.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.perl.org.il/pipermail/perl/attachments/20121012/171c54f2/attachment-0001.htm 


More information about the Perl mailing list