[Israel.pm] Detecting Random characters

Shlomo Yona shlomo at cs.haifa.ac.il
Sun Oct 3 06:19:09 PDT 2004


On Sun, 3 Oct 2004, Ronen Angluster wrote:

[...]

> but a way to find out if dsfadffadh at something.com and knowing that the
> "dsfadffadh" part is random, WITHOUT contacting the originating server
> and requesting confirmation about this mailbox, this is something that
> buffles me completly.

I fear that there's no 100% solution to your problem in the general
case.

In the less general case, you can try and "learn" the
properties of the "random" strings. You can do that by
composing a list of such strings (preferably from the SPAM
you got and tagged). Next you need to setup a set of
features that you're interested in:
 	length of the string
 	characters used in each position of the string
 	perhaps also the context of each character in each
position (0 or more characters behind and/or forrward)
 	.
 	.
 	.
there are many options, not all of them useful, by the way.

Once you have a set of features, you can write a feature
extractor that will be used to scan a string and build a
feature vector out of it.

The next step is to buid feature vectors for all the
"random" strings you have (you'll need many of them, MANY
MANY, in order to learn something meaningful) and then feed
them to some classification algorithm (check out
AI::Categorizer or some of the AI::* modules, or implement
on by yourown based on some popular categorization algorithm
such as Perceptron, Winnow, Naive Bayes, memory based... and so on).

After the learning stage you should end up with a classifier
which is supposed to be able to receive as input a feature
vector of a string and output a result ("is valid" or
"random").

Actually, many SPAM filters use such strategies, but they
don't restrict them to the sender's email address (as those
are faked anyway, sometimes even email of legitimate users
that were used as the sender) but they use clues from the
header, the subject and the emails' content too.

SPAM filtering is a pain and no-one seems to have a silver
bullet for it. Many try, though.

You can get ideas from many projects (a lot of them are open
source): For example: SPAM Assassin http://spamassassin.apache.org/

Good luck.

-- 
Shlomo Yona
shlomo at cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/



More information about the Perl mailing list