[Israel.pm] optimizing memory usage

Shlomo Yona shlomo at cs.haifa.ac.il
Sun Jan 4 16:04:15 PST 2004


Hello,

The following code is a 10 minute re-invention-of-the-wheel
for a simple n-gram analysis (also known as "sliding window
statistics").

After spending about 2 hours searching for ready made
solutions in Perl and realizing that the available stuff do
not meet my needs or that they perhaps do but I can't seem
to figure out how... I decided to commit the horrible sin of
making my own version for it. Just to understand how
horrible is this sin: search google for 
Perl ngram
and see how many implementations are there.

anyway -- due to time limitations and considering that
writing my own code will most likely give me better
understanding of how the code works (and then also I can be
proud that all the bugs are mine and not someone else's) I
sat down and wrote the followig:


(the code is also attahced, for those who see it screwed up
on screen thanks to the mail program)
=== begin code: ngram.pl ===

#!/usr/bin/perl -w
use strict;
use warnings;

my $files_dir = '/home/shlomo/a7/archive/statistics/';
system("mkdir -p $files_dir") unless -d $files_dir;
my %token_counts=();
my $text = join '',<>;
$text=~s/^\s*//;
$text=~s/\s*$//;
$text=~s/\s+//gs;
my @window_sizes=(1 .. 5);
my $model= {
	words => {
		sorting => {
			"by_frequency-alphabet.txt" => sub {
				$token_counts{$b} <=> $token_counts{$a} 
				||
				$a cmp $b
			},
			"by_alphabet-frequency.txt" => sub {
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
			"by_token_length-frequency-alphabet.txt" => sub {
				length($b) <=> length($a) 
				||
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
		},
		token_separator => qr/\s+/,
		token_vector_preparation => sub {
			my ($txt,$ws,$ts) = @_;
			return (
				split(//,' ' x ($ws-1)),
				split(/$ts/,$txt),
				split(//,' ' x ($ws-1))
			);
		},
	},
	characters => {
		sorting => {
			"by_frequency-alphabet.txt" => sub {
				$token_counts{$b} <=> $token_counts{$a} 
				||
				$a cmp $b
			},
			"by_alphabet-frequency.txt" => sub {
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
		},
		token_separator => qr//,
		token_vector_preparation => sub {
			my ($txt,$ws,$ts) = @_;
			return (
				split(//,' ' x ($ws-1)),
				split(/$ts/,$txt),
				split(//,' ' x ($ws-1))
			);
		},
	},
};

foreach my $window (@window_sizes) {
	while( my ($k,$v) = each %$model) {
		# prepare tokens vector
		my @tokens = &{$v->{token_vector_preparation}}($text,$window,$v->{token_separator});
		%token_counts=();
		for (my $i=0; $i<@tokens-$window+1; ++$i) {
			my $from = $i;
			my $to = $i+$window-1;
			++$token_counts{join(' ', at tokens[$from .. $to])};
		}
		while (my ($filename_suffix,$sort_sub) = each %{$v->{sorting}}) {
			my $filename = $files_dir.$k.".".$window."-gram.".$filename_suffix;
			system("mv $filename $filename.bak") if -f $filename;
			open(OUT,">$filename") or die "Cannot open $filename for writing: $!\n";
			foreach my $token (sort $sort_sub keys %token_counts) {
				print OUT $token,"\t",$token_counts{$token},"\n";
			}
			close(OUT) or die "Cannot close $filename after writing: $!\n";
		}
	}
}

=== end code ===

What basically goes on here is a script accepting text
input, and then "splits" the text into units (a unit may be
a word or a character) and then using a sliding window of a
fixed size of units (in this case the size can be either 1,
2, 3, 4 or 5) the "stuff" seen in the window is considered a
key which is associated with a counter. As the window slides
over the text, this simple algorithm collects the statistics
which are then being sorted and printed into a file for
later offline observations.



The reason I am bothering you with this code is that my
input text is rather big: about 280MB of text. 

I run the following onliner:

find archive/tknz/ -type f -name "*.tknz" -exec cat {} \; | bin/ngram.pl

which basically takes all text files located in/under some
directory tree and shoves it through a pipe into the
standard input of my script...

And I run it on a powerful server with 515920K  of memory plus 408008K of swap space. 

Once the 'find' completes to read the text and the perl
script is left to run alone it somes to a point where the
swap space is reduced to 0 and the available RAM is less
than 10%. At that point a series of memory allocation errors
(probably requests for a large chunk of memory which isn't
available due to fragmentation) cause the script to be
killed by the operating system. Another possibility is that
the script was asking for more memory than it was allowed
to.

I wonder if you guys see any room for doing anything which
can contribute to reducing the memory used by this script
hoping that it will enable it to live long enough to
actually finidh all the loops.

Please note that the run time is not the part which bothers
me. Anyway -- most of the time consumed by this script is
due to operating system trashing while trying desperately to
swap data between the disk and the RAM... 

I suppose it should be nice to speed it up -- but memory is
my biggest concern at the time.


Thanks.

-- 
Shlomo Yona
shlomo at cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/
-------------- next part --------------
#!/usr/bin/perl -w
use strict;
use warnings;

my $files_dir = '/home/shlomo/a7/archive/statistics/';
system("mkdir -p $files_dir") unless -d $files_dir;
my %token_counts=();
my $text = join '',<>;
$text=~s/^\s*//;
$text=~s/\s*$//;
$text=~s/\s+//gs;
my @window_sizes=(1 .. 5);
my $model= {
	words => {
		sorting => {
			"by_frequency-alphabet.txt" => sub {
				$token_counts{$b} <=> $token_counts{$a} 
				||
				$a cmp $b
			},
			"by_alphabet-frequency.txt" => sub {
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
			"by_token_length-frequency-alphabet.txt" => sub {
				length($b) <=> length($a) 
				||
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
		},
		token_separator => qr/\s+/,
		token_vector_preparation => sub {
			my ($txt,$ws,$ts) = @_;
			return (
				split(//,' ' x ($ws-1)),
				split(/$ts/,$txt),
				split(//,' ' x ($ws-1))
			);
		},
	},
	characters => {
		sorting => {
			"by_frequency-alphabet.txt" => sub {
				$token_counts{$b} <=> $token_counts{$a} 
				||
				$a cmp $b
			},
			"by_alphabet-frequency.txt" => sub {
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
		},
		token_separator => qr//,
		token_vector_preparation => sub {
			my ($txt,$ws,$ts) = @_;
			return (
				split(//,' ' x ($ws-1)),
				split(/$ts/,$txt),
				split(//,' ' x ($ws-1))
			);
		},
	},
};

foreach my $window (@window_sizes) {
	while( my ($k,$v) = each %$model) {
		# prepare tokens vector
		my @tokens = &{$v->{token_vector_preparation}}($text,$window,$v->{token_separator});
		%token_counts=();
		for (my $i=0; $i<@tokens-$window+1; ++$i) {
			my $from = $i;
			my $to = $i+$window-1;
			++$token_counts{join(' ', at tokens[$from .. $to])};
		}
		while (my ($filename_suffix,$sort_sub) = each %{$v->{sorting}}) {
			my $filename = $files_dir.$k.".".$window."-gram.".$filename_suffix;
			system("mv $filename $filename.bak") if -f $filename;
			open(OUT,">$filename") or die "Cannot open $filename for writing: $!\n";
			foreach my $token (sort $sort_sub keys %token_counts) {
				print OUT $token,"\t",$token_counts{$token},"\n";
			}
			close(OUT) or die "Cannot close $filename after writing: $!\n";
		}
	}
}


More information about the Perl mailing list