[Israel.pm] binary vectors representation
semuel at semuel.co.il
Sun Jun 13 05:59:34 PDT 2004
What about RLE? (Run-Length-Encoding)
Can be a good cast for this.
From: perl-bounces at perl.org.il [mailto:perl-bounces at perl.org.il] On Behalf
Of Shlomo Yona
Sent: Sunday, June 13, 2004 7:04 AM
To: Perl in Israel
Subject: Re: [Israel.pm] binary vectors representation
On Fri, 11 Jun 2004, Mikhael Goikhman wrote:
I'll look into it.
Actually, the more I think of it, the more I'm convinved now
that playing with binary representation isn't such a good
way of "compressing" the data.
A better way for representing sparse binary vectors (vectors
which are very long but most of them are zeros while a very
small part of them is ones), will probably be to simply list
the indices of the items which are nonzero.
Going over my data I realized that vectors length is
30,000-50,000 items long, and every vector has no more than
30 items marked with 1 leaving all the rest marked with 0.
So, I will actually need perl to store about 30-50 numbers
of size 0-50,000. This sounds like a better compression than
1:8. If the scalar representing the index is even 8 byte
long I get
50 indices * 8 bytes each = 400 bytes
50000 indices packed into bytes in bit representation = 50000 / 8 =
so in the worst case I still get over 15 times better
So -- I think I'll now wrap the code that generates these
vectors to store them in file by writing only the nonzero
indices, rather than writing the 1/0 content of every index.
I'll also wrap the code which expects to get vectors by a
code which reads index-representation of a vector and
expands it to a regular vector.
I suppose that if I really want to speed up computation time
I can complicate the calculations part and have it
use the fact that very little indices are nonzero... but
I'll avoid that for now.
> What is the project we speak about, is it commercial?
This is simply an implementation of a feature extractor for
producing feature vectors for the classifier I've already
showed you. As features are very much task and data
depended, the code of a feature extractor is normally not
very generic (unlike the classifier's code).
Nothing is commercial here. I'll gladly share code, data and
results if there's interest. If we end up really off topic
we can take it offlist, although so far I don't think we are
shlomo at cs.haifa.ac.il
Perl mailing list
Perl at perl.org.il
More information about the Perl