[Israel.pm] binary vectors representation

Mikhael Goikhman migo at homemail.com
Fri Jun 11 05:02:30 PDT 2004


On 11 Jun 2004 13:48:05 +0300, Shlomo Yona wrote:
> 
> I have strings in @strings, each is a scalar of length
> $vector_size (vector size is a fixed value detremined up
> front and is normally somewere between 30000-50000).
> 
> I'm storing the strings in a text file, one string per line.
> This means that each line is of length $vector_size+1 (the
> extra character is due to the newline character).
> 
> The characters making the strings are only '0' and '1'. This
> makes the strings "binary vectors".

I suggest to work with groups of 32 bits, this is well supported in Perl.

So, to convert your zero-one files to binary files do the following:

	open(ZERONE, "<current-format.txt") or die $!;
	open(BINARY, ">new-format.bin") or die $!;
	binmode(BINARY);

	while (my $line = <ZERONE>) {
		while (my $substr32 = substr($line, 0, 32, "")) {
			my $number32 = oct("0b$substr32");
			my $bytes4 = pack("L", $number32);
			printf BINARY $bytes4;
		}
	}

	close(ZERONE);
	close(BINARY);

Of course there are other ways to do the same, the data is large, so
it make sense to start benchmarking one or another perl way to do it.

To convert in the opposite direction you may read 4 bytes and then
convert them to original 32 bit substring, like:

	my $number32 = unpack("L", $bytes4);
	my $substr32 = printf("%b", $number32);

Or even read the entire $vector_size line at once:

	my $vector_size = 30000;

	open(ZERONE, ">current-format2.txt") or die $!;
	open(BINARY, "<new-format.bin") or die $!;
	binmode(BINARY);

	my $num_line_bytes = $vector_size / 8;
	my $num_line_numbers = $num_line_bytes / 4;
	my $bin_line;
	while (sysread(BINARY, $bin_line, $num_line_bytes)) {
                last unless sysread(BINARY, $bin_line, $num_line_bytes);
		my @numbers32 = unpack("${num_line_numbers}L", $bin_line);
		my @substrs32 = map { sprintf("%32b", $_) } @numbers32;
		my $original_zero_one_line = join('', @substrs32) . "\n";
		print ZERONE $original_zero_one_line;
	}

	close(ZERONE);
	close(BINARY);

You should know what is byte (8 bits) and unsigned long in Perl (32 bits).
There is also Endian problem on some rare architectures, but hopefully
you don't need to support such architectures.

Regards,
Mikhael.

-- 
perl -e 'print+chr(64+hex)for+split//,d9b815c07f9b8d1e'



More information about the Perl mailing list