[Israel.pm] binary vectors representation

Shlomo Yona shlomo at cs.haifa.ac.il
Fri Jun 11 06:21:23 PDT 2004


On Fri, 11 Jun 2004, Mikhael Goikhman wrote:

> I suggest to work with groups of 32 bits, this is well supported in Perl.

Thanks Mikhael.
I think I need something similar but a bit different.

> So, to convert your zero-one files to binary files do the following:

I don't want to convert my files.
I want to convert my scalars to bit vectores, store them in
files and then be able to later read bit vectors from files
(hoping that the bit vector storing is more compact than
storing them as strings...) and do my processing on them.

> 	open(ZERONE, "<current-format.txt") or die $!;
> 	open(BINARY, ">new-format.bin") or die $!;
> 	binmode(BINARY);
> 
> 	while (my $line = <ZERONE>) {
> 		while (my $substr32 = substr($line, 0, 32, "")) {

Why truncate 32 characters before even looking at them?

> 			my $number32 = oct("0b$substr32");

I'm now comfused...
What does oct do? does it convert base 10, or hex or binary
to base 8? 
why?

> 			my $bytes4 = pack("L", $number32);
> 			printf BINARY $bytes4;

unsign long is always 32 bit, right?

> 		}
> 	}
> 
> 	close(ZERONE);
> 	close(BINARY);
> 
> Of course there are other ways to do the same, the data is large, so
> it make sense to start benchmarking one or another perl way to do it.
> 
> To convert in the opposite direction you may read 4 bytes and then
> convert them to original 32 bit substring, like:
> 
> 	my $number32 = unpack("L", $bytes4);
> 	my $substr32 = printf("%b", $number32);
> 
> Or even read the entire $vector_size line at once:
> 
> 	my $vector_size = 30000;
> 
> 	open(ZERONE, ">current-format2.txt") or die $!;
> 	open(BINARY, "<new-format.bin") or die $!;
> 	binmode(BINARY);
> 
> 	my $num_line_bytes = $vector_size / 8;
> 	my $num_line_numbers = $num_line_bytes / 4;
> 	my $bin_line;
> 	while (sysread(BINARY, $bin_line, $num_line_bytes)) {
>                 last unless sysread(BINARY, $bin_line, $num_line_bytes);
> 		my @numbers32 = unpack("${num_line_numbers}L", $bin_line);
> 		my @substrs32 = map { sprintf("%32b", $_) } @numbers32;
> 		my $original_zero_one_line = join('', @substrs32) . "\n";
> 		print ZERONE $original_zero_one_line;
> 	}
> 
> 	close(ZERONE);
> 	close(BINARY);
> 
> You should know what is byte (8 bits) and unsigned long in Perl (32 bits).
> There is also Endian problem on some rare architectures, but hopefully
> you don't need to support such architectures.

I'm not sure I understand the solution.

Moreover, I'm now curious whether or not bit vector
representation will actually be a more compact
representation or not.

Moreover, I wonder if I can do my processing on bits and
get better memory/speed performance... for example,
rewriting the algorithm in Classifier.pm to support bit
vectors instead of arrays of 0/1.

Thanks, Mikhael, I'm waiting for your response.

-- 
Shlomo Yona
shlomo at cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/



More information about the Perl mailing list