[Israel.pm] optimizing memory usage

Shlomo Yona shlomo at cs.haifa.ac.il
Sun Jan 4 18:10:52 PST 2004


Hello,

I just woke up thinking how stupid of me to read all the
text at once, while for the sliding window it doesn't really
matter...

So, The script now is run from the commandline like this:

find archive/tknz/ -type f -name "*.tknz" | bin/ngram.pl

And here is the script (again, I attached it for those who
see it with broken lines):

=== begin script (ngram.pl) ===

#!/usr/bin/perl -w
use strict;
use warnings;

my $files_dir = '/home/shlomo/a7/archive/statistics/';
system("mkdir -p $files_dir") unless -d $files_dir;
my %token_counts=();
my $model= {
	words => {
		sorting => {
			"by_frequency-alphabet.txt" => sub {
				$token_counts{$b} <=> $token_counts{$a} 
				||
				$a cmp $b
			},
			"by_alphabet-frequency.txt" => sub {
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
			"by_token_length-frequency-alphabet.txt" => sub {
				length($b) <=> length($a) 
				||
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
		},
		token_separator => qr/\s+/,
		token_vector_preparation => sub {
			my ($txt_ref,$ws,$ts) = @_;
			return (
				split(//,' ' x ($ws-1)),
				split(/$ts/,$$txt_ref),
				split(//,' ' x ($ws-1))
			);
		},
	},
	characters => {
		sorting => {
			"by_frequency-alphabet.txt" => sub {
				$token_counts{$b} <=> $token_counts{$a} 
				||
				$a cmp $b
			},
			"by_alphabet-frequency.txt" => sub {
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
		},
		token_separator => qr//,
		token_vector_preparation => sub {
			my ($txt_ref,$ws,$ts) = @_;
			return (
				split(//,' ' x ($ws-1)),
				split(/$ts/,$$txt_ref),
				split(//,' ' x ($ws-1))
			);
		},
	},
};

my @window_sizes=(1 .. 5);
my @txt_files = <>;
foreach my $window (@window_sizes) {
	while( my ($k,$v) = each %$model) {
		%token_counts=();
		foreach my $txt_filename (@txt_files) {
			chomp $txt_filename;
			open(IN,"<$txt_filename") or die "Cannot open $txt_filename for reading: $!\n";
			my $text=join '',<IN>;
			close(IN) or die "Cannot close $txt_filename after reading: $!\n";
			$text=~s/^\s*//;
			$text=~s/\s*$//;
			$text=~s/\s+//gs;
			# prepare tokens vector
			my @tokens = &{$v->{token_vector_preparation}}(\$text,$window,$v->{token_separator});
			for (my $i=0; $i<@tokens-$window+1; ++$i) {
				my $from = $i;
				my $to = $i+$window-1;
				++$token_counts{join(' ', at tokens[$from .. $to])};
			}
		}
		while (my ($filename_suffix,$sort_sub) = each %{$v->{sorting}}) {
			my $filename = $files_dir.$k.".".$window."-gram.".$filename_suffix;
			system("mv $filename $filename.bak") if -f $filename;
			open(OUT,">$filename") or die "Cannot open $filename for writing: $!\n";
			foreach my $token (sort $sort_sub keys %token_counts) {
				print OUT $token,"\t",$token_counts{$token},"\n";
				$token_counts{$token}=undef;
				delete $token_counts{$token};
			}
			close(OUT) or die "Cannot close $filename after writing: $!\n";
		}
	}
}

=== end script ===

The largest chunk in memory now are the files list (there
are about 60,000-70,000 files and their number is growing
daily very slowly: about 10-20 a day) -- so I'm not worried
about this front for now. The largest chunk is the
token_counts hash -- but still, according to the memory
usage, the script now takes no more than 2%-3% of the
systems' memory (which is something I can easily live with).
No trashing anymore, no crashing. 
All this with a small reordering of some of the code and a
little change in the order of operations (read in between
file reads instead of reading all files and only then
process the information).

I'll make sure that the output is to my liking and then I'll
go back to sleep.



-- 
Shlomo Yona
shlomo at cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/
-------------- next part --------------
#!/usr/bin/perl -w
use strict;
use warnings;

my $files_dir = '/home/shlomo/a7/archive/statistics/';
system("mkdir -p $files_dir") unless -d $files_dir;
my %token_counts=();
my $model= {
	words => {
		sorting => {
			"by_frequency-alphabet.txt" => sub {
				$token_counts{$b} <=> $token_counts{$a} 
				||
				$a cmp $b
			},
			"by_alphabet-frequency.txt" => sub {
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
			"by_token_length-frequency-alphabet.txt" => sub {
				length($b) <=> length($a) 
				||
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
		},
		token_separator => qr/\s+/,
		token_vector_preparation => sub {
			my ($txt_ref,$ws,$ts) = @_;
			return (
				split(//,' ' x ($ws-1)),
				split(/$ts/,$$txt_ref),
				split(//,' ' x ($ws-1))
			);
		},
	},
	characters => {
		sorting => {
			"by_frequency-alphabet.txt" => sub {
				$token_counts{$b} <=> $token_counts{$a} 
				||
				$a cmp $b
			},
			"by_alphabet-frequency.txt" => sub {
				$a cmp $b
				||
				$token_counts{$b} <=> $token_counts{$a}
			},
		},
		token_separator => qr//,
		token_vector_preparation => sub {
			my ($txt_ref,$ws,$ts) = @_;
			return (
				split(//,' ' x ($ws-1)),
				split(/$ts/,$$txt_ref),
				split(//,' ' x ($ws-1))
			);
		},
	},
};

my @window_sizes=(1 .. 5);
my @txt_files = <>;
foreach my $window (@window_sizes) {
	while( my ($k,$v) = each %$model) {
		%token_counts=();
		foreach my $txt_filename (@txt_files) {
			chomp $txt_filename;
			open(IN,"<$txt_filename") or die "Cannot open $txt_filename for reading: $!\n";
			my $text=join '',<IN>;
			close(IN) or die "Cannot close $txt_filename after reading: $!\n";
			$text=~s/^\s*//;
			$text=~s/\s*$//;
			$text=~s/\s+//gs;
			# prepare tokens vector
			my @tokens = &{$v->{token_vector_preparation}}(\$text,$window,$v->{token_separator});
			for (my $i=0; $i<@tokens-$window+1; ++$i) {
				my $from = $i;
				my $to = $i+$window-1;
				++$token_counts{join(' ', at tokens[$from .. $to])};
			}
		}
		while (my ($filename_suffix,$sort_sub) = each %{$v->{sorting}}) {
			my $filename = $files_dir.$k.".".$window."-gram.".$filename_suffix;
			system("mv $filename $filename.bak") if -f $filename;
			open(OUT,">$filename") or die "Cannot open $filename for writing: $!\n";
			foreach my $token (sort $sort_sub keys %token_counts) {
				print OUT $token,"\t",$token_counts{$token},"\n";
				$token_counts{$token}=undef;
				delete $token_counts{$token};
			}
			close(OUT) or die "Cannot close $filename after writing: $!\n";
		}
	}
}


More information about the Perl mailing list