[Israel.pm] binary vectors representation

Shlomo Yona shlomo at cs.haifa.ac.il
Sat Jun 12 22:03:56 PDT 2004


On Fri, 11 Jun 2004, Mikhael Goikhman wrote:

[...]

Thanks.
I'll look into it.

Actually, the more I think of it, the more I'm convinved now
that playing with binary representation isn't such a good
way of "compressing" the data.

A better way for representing sparse binary vectors (vectors
which are very long but most of them are zeros while a very
small part of them is ones), will probably be to simply list
the indices of the items which are nonzero.
Going over my data I realized that vectors length is
30,000-50,000 items long, and every vector has no more than
30 items marked with 1 leaving all the rest marked with 0.

So, I will actually need perl to store about 30-50 numbers
of size 0-50,000. This sounds like a better compression than
1:8. If the scalar representing the index is even 8 byte
long I get
	50 indices * 8 bytes each = 400 bytes
vs.
	50000 indices packed into bytes in bit representation = 50000 / 8 = 6250
so in the worst case I still get over 15 times better
compression.

So -- I think I'll now wrap the code that generates these
vectors to store them in file by writing only the nonzero
indices, rather than writing the 1/0 content of every index.
I'll also wrap the code which expects to get vectors by a
code which reads index-representation of a vector and
expands it to a regular vector.

I suppose that if I really want to speed up computation time
I can complicate the calculations part and have it
use the fact that very little indices are nonzero... but
I'll avoid that for now.

> What is the project we speak about, is it commercial?

This is simply an implementation of a feature extractor for
producing feature vectors for the classifier I've already
showed you. As features are very much task and data
depended, the code of a feature extractor is normally not
very generic (unlike the classifier's code).

Nothing is commercial here. I'll gladly share code, data and
results if there's interest. If we end up really off topic
we can take it offlist, although so far I don't think we are
:-)

-- 
Shlomo Yona
shlomo at cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/



More information about the Perl mailing list