[Israel.pm] binary vectors representation

Mikhael Goikhman migo at homemail.com
Fri Jun 11 06:54:41 PDT 2004


On 11 Jun 2004 16:21:23 +0300, Shlomo Yona wrote:
> 
> On Fri, 11 Jun 2004, Mikhael Goikhman wrote:
> 
> > I suggest to work with groups of 32 bits, this is well supported in Perl.
> 
> Thanks Mikhael.
> I think I need something similar but a bit different.

Ok, but you may at least run these convertors (with typos fixed) to see
what they do. They compress your data by factor of 8 and remove newlines.

> > So, to convert your zero-one files to binary files do the following:
> 
> I don't want to convert my files.
> I want to convert my scalars to bit vectores, store them in
> files and then be able to later read bit vectors from files
> (hoping that the bit vector storing is more compact than
> storing them as strings...) and do my processing on them.
> 
> > 	open(ZERONE, "<current-format.txt") or die $!;
> > 	open(BINARY, ">new-format.bin") or die $!;
> > 	binmode(BINARY);
> > 
> > 	while (my $line = <ZERONE>) {
> > 		while (my $substr32 = substr($line, 0, 32, "")) {
> 
> Why truncate 32 characters before even looking at them?

Unsigned long is 32 bits.

> > 			my $number32 = oct("0b$substr32");
> 
> I'm now comfused...
> What does oct do? does it convert base 10, or hex or binary
> to base 8? 
> why?

Please read "perldoc -f oct", it is all answered there.
We don't work with bases other than 2 anywhere.

> > 			my $bytes4 = pack("L", $number32);
> > 			printf BINARY $bytes4;
> 
> unsign long is always 32 bit, right?

Yes, this is documented in "perldoc -f pack".

And here is one more typo, use print, not printf here. Because these 4
bytes we want to print may casually contain some "%d" substring.

> > 		}
> > 	}
> > 
> > 	close(ZERONE);
> > 	close(BINARY);
> > 
> > Of course there are other ways to do the same, the data is large, so
> > it make sense to start benchmarking one or another perl way to do it.
> > 
> > To convert in the opposite direction:
> > 
> > 	my $vector_size = 30000;
> > 
> > 	open(ZERONE, ">current-format2.txt") or die $!;
> > 	open(BINARY, "<new-format.bin") or die $!;
> > 	binmode(BINARY);
> > 
> > 	my $num_line_bytes = $vector_size / 8;
> > 	my $num_line_numbers = $num_line_bytes / 4;
> > 	my $bin_line;
> > 	while (sysread(BINARY, $bin_line, $num_line_bytes)) {
> > 		my @numbers32 = unpack("${num_line_numbers}L", $bin_line);
> > 		my @substrs32 = map { sprintf("%32b", $_) } @numbers32;
> > 		my $original_zero_one_line = join('', @substrs32) . "\n";
> > 		print ZERONE $original_zero_one_line;
> > 	}
> > 
> > 	close(ZERONE);
> > 	close(BINARY);
> > 
> > You should know what is byte (8 bits) and unsigned long in Perl (32 bits).
> > There is also Endian problem on some rare architectures, but hopefully
> > you don't need to support such architectures.
> 
> I'm not sure I understand the solution.
> 
> Moreover, I'm now curious whether or not bit vector
> representation will actually be a more compact
> representation or not.

Just in case, this solution does not store integers in base 10, it stores
real binary data, every byte may randomly have a value from 0 to 255.

> Moreover, I wonder if I can do my processing on bits and
> get better memory/speed performance... for example,
> rewriting the algorithm in Classifier.pm to support bit
> vectors instead of arrays of 0/1.

Yes, you may use any bitwise operators from "perldoc perlop" man page and
they will be fast.

You may try to find modules to do this for you. It seems easy enough to
me, but if you know of a good tested module, use it.

What is the project we speak about, is it commercial?

Regards,
Mikhael.

-- 
perl -e 'print+chr(64+hex)for+split//,d9b815c07f9b8d1e'



More information about the Perl mailing list