[Israel.pm] Re: optimising memmory usage

Yosef Meller mellerf at netvision.net.il
Mon Jan 5 03:55:13 PST 2004

> After spending about 2 hours searching for ready made
> solutions in Perl and realizing that the available stuff do
> not meet my needs or that they perhaps do but I can't seem
> to figure out how... I decided to commit the horrible sin of
> making my own version for it. Just to understand how

What horrible sin? What fun is programming if you can't re-invent the
wheel sometimes? or perhaps the only viable form of programming is doing
the undone and producing the first truly artifficially intelligent
computer :) ?

> The reason I am bothering you with this code is that my
> input text is rather big: about 280MB of text. 
> I run the following onliner:
> find archive/tknz/ -type f -name "*.tknz" -exec cat {} \; | bin/ngram.pl
> which basically takes all text files located in/under some
> directory tree and shoves it through a pipe into the
> standard input of my script...

Is it important for all the input files to be concatenated? If not, you
can just fetch the file names and then process each at a time.
If it is, you can still process each at a time and then sum the results
from all files into one (however then sequences at ends of files will
not be concatenated to starts of files).

> And I run it on a powerful server with 515920K  of memory plus 408008K of swap space. 
> Once the 'find' completes to read the text and the perl
> script is left to run alone it somes to a point where the
> swap space is reduced to 0 and the available RAM is less
> than 10%. At that point a series of memory allocation errors
> (probably requests for a large chunk of memory which isn't
> available due to fragmentation) cause the script to be
> killed by the operating system. Another possibility is that
> the script was asking for more memory than it was allowed
> to.
> I wonder if you guys see any room for doing anything which
> can contribute to reducing the memory used by this script
> hoping that it will enable it to live long enough to
> actually finidh all the loops.

Oooh! A chance for some low level programming in perl, fun!

Inspired by the code of Acme::Bleach, I found a way to reduce the size
of your hash keys by packing them with single bits instead of space

The script below adds a set bit at the beginnig of every charachter but
the first in a token, and an unset bit at the start of every token but
the first. This results in great saving for one charachter tokens, and
diminishing return for longer tokens - actually being worse for tokens
longer than 8 charachters - but how many words are that long?

Another gotchas: Only applies to ascii-derived encodings. Probably very
CPU intensive.

# best case - saves 3 bytes.
my @toks = ('I', 'a', 'n', 'a', 'm'); 
# avg. case - saves 1 byte.
my @toks = ('I', 'am', 'not', 'afraid', 'man'); 
# worst case - cost 4 bytes.
my @toks = ('International', 'association', 'of_notoriously',
	'associated', 'meritocrats'); 

sub convtok {
	return (join '1',
		(map {unpack('b*', $_)} 
			split(//, shift) )) . '0';

my $key = join "", map {convtok $_} @toks;
chop $key;
$key .= '0' x (8-(length($key) % 8)) if (length($key) % 8);
$key = pack "b*", $key;

# use this to see how much you saved
print length($key), "\n -- $key\n";

# Now let's get back something printable:
my $back = unpack 'b*', $key;
sub deconv {
	my $toke = '';
	my ($str, $curroffs) = @_;
	do {
		$toke .= pack "b*", substr($str, $curroffs, 8);
		$curroffs += 9;
	} while (substr($str, $curroffs-1, 1));
	return ($toke, $curroffs);
my @new_words;
my $offs = 0;
while ($offs<length $back) {
	my $deconved;
	($deconved, $offs) = deconv($back, $offs);
	push @new_words, $deconved;

# Use this to write to files:
my $record_name = join ' ', @new_words;
print "rec -- $record_name\n";

> Please note that the run time is not the part which bothers
> me. Anyway -- most of the time consumed by this script is
> due to operating system trashing while trying desperately to
> swap data between the disk and the RAM... 
> I suppose it should be nice to speed it up -- but memory is
> my biggest concern at the time.

Yeah, count on the above script to cost you a lot of time.

> Thanks.
You're welcome :)

Yosef Meller.

BTW, Re: the recent OSS debate - This script is _free_ of charge (and in
spirit), _opened_ for view and use, and GPL'ed (Copyright by me). As the
creator of this FOSS code I qualify as an open-source advocate, so I
guess I'm a perl programmer && OSS advocate... Use Linux! Use Apache
(bless you)! Use Subversion! use strict;

To verify my electronic signature or send me encrypted mail, use my
public key at:

If you don't know anything about that, try:
http://www.gnupg.org/gph/en/manual.html for a start on encription and

More information about the Perl mailing list