[Israel.pm] binary vectors representation
oferk at oren.co.il
Sun Jun 13 00:43:50 PDT 2004
> I have strings in @strings, each is a scalar of length
> $vector_size (vector size is a fixed value detremined up
> front and is normally somewere between 30000-50000).
> I'm storing the strings in a text file, one string per line.
> This means that each line is of length $vector_size+1 (the
> extra character is due to the newline character).
> The characters making the strings are only '0' and '1'. This
> makes the strings "binary vectors".
> I'd like to store the vectors more compactly, using bit
> representation. I'd also like to be able to easily convert
> the strings to binary format, storing them in a file and
> later on be able to easily extract them one by one from a
> file and reconstruct them as strings, or as arrays.
> It seems that the tools to use here are 'pack' and 'unpack',
> however, I'm not sure which template to use, and also, I'm
> not sure how to store the binary data to a file and later on
> read it from a file.
> What would be the idioms to use here?
> Shlomo Yona
Some possible approaches to compact and handle your bit vectors:
1. Use bzip2/gzip on your data file. The simplest and perhaps easiest
2. Use http://search.cpan.org/~amruta/Sparse-0.02/
3. Turn data into a piddle (PDL data structure) and use
(this might be your best bet- should be simple yet very quick and powerful).
4. If @strings is very large, you might consider using a C library to read
and handle the data, wrapped in Perl using SWIG for example. There are
efficient libraries for specific architectures such as Intel's MKL, or you
can try a non-specific library. If you decide to try this route let me know
and I'll help if I can...
More information about the Perl