[Israel.pm] Pesach cleanup

Gabor Szabo gabor at perl.org.il
Mon Apr 12 17:48:53 PDT 2004



Pesach is the major cleaning event of the year so I too decided
to do some cleaning. I had a lot to do, I was receiving about
1000-1500 emails a day with 80-90% being plain SPAM.

For a long time I did not use any preventative measures
as I was afraid I will lose important e-mails.
Now I said, enough. I have to give it a try at least.

So I installed Mail::SpamAssassin on my server which took about
10 minutes, mostly CPAN doing its job. I had to read a bit of the
documentation for the manual part of it which took an additional
20 minutes including the part about procmail.

After the first day it had filtered out about 30% of my messages
into my SPAM folder. Nice, but far from my expectations.


So I started to teach the Bayesian database what is SPAM and what
is (kosher) HAM. This is actually taking each message and filtering
it through some kind of a program telling to this program about
each message if that was a --spam or a --ham.

After doing this with about 1-2000 messages (don't worry most
of this was already automated :) now I reached a point
where about 96-98% of the SPAM is correctly recognised as spam
and thus moved to another folder automatically.

It still leaves me with 20-40 spams a day but that's really manageable
already.

Of course what I was afraid of, that the filter will mark good
message as spams (false positives) also happened. In the
first few days - after going through about 5000 messages marked as
SPAM I found about 10 which were real messages. On one hand it worries
me a bit as I do lose valuable messages. On the other hand I
know that earlier when I had to go through all the messages and
manually decide on each one if this is SPAM or not, even then I managed
to delete a few legitimate messages. So maybe this filter is not much
worse. In addition I told the Baysean engine that these are good
messages (HAMs) and in the last 2 days I have not seen any false
positives. So maybe after some more training the engine improves
itself even further.


Anyway, I'll have to see how it works in the next few weeks but
so far I am very satisfied with my cleaning efforts.

You can read an article about this module on Perl.com:
http://www.perl.com/pub/a/2002/03/06/spam.html


Gabor
















More information about the Perl mailing list