Naive Bayesian is a failure against SPAM

By Angsuman Chakraborty, Gaea News Network
Monday, March 14, 2005

I receive daily several thousands of spam emails on debt consolidation, mortgage and on organ enhancement advice/pills for organs I do not possess. I have been offered millions by unknown Nigerians, offered services I do not care about, images which were bad enough to spoil my day!

I manage to filter most of them daily using SpamBayes, a naive bayesian filter integrated closely with Outlook.
However spammers have adapted. They systemetically send spams to poison the database and they employ a wide variety of other tricks. As a result around 100 or so spams land everyday in my inbox and the nummber is steadily increasing.

The SpamBayes filter is very large and the filtering takes lot of time. Adding email to the filter also takes time.

Bayesian filter is not intelligent to realize that an email (even if it is a forwarded joke) from a CEO is not a SPAM, specially when he is a close friend ( and who can give me business ). The filter is too dumb (naive?) to be a long term solution.

By and large naive bayesian filters have lost against spammers as the sole anti-spam solution provider. What we need today is an array of filters in a convenient packaging which can be applied at will to weed out spam. We need to adapt too.

Filed under: Spam Watch, Web

Tags: ,
Discussion

achmad
May 23, 2008: 10:48 pm

how can I get the code of SpamBayes algorithms?

August 13, 2005: 8:03 pm

@Seth
Thanks for the valuable insight about using only last months email in the corpus.

August 13, 2005: 12:07 am

my spambayes is through procmail, which autofilters into Maildir folders for mutt and is retrained every night at 2AM on viewed hams and spams (unvieweds are left out) that have been touched in the last month (old messages are autoarchived after a month, actually). I set the spam threshold to 0.01% spam and mutt is able to display the individual spam score in the spam folder index.

One in maybe five thousand spams get through and I get 9 times the ham as spam. 1400 spam messages per month get filtered on average. I do have a percentage of false positives due to the low threshold, however 90% of the time I don’t want to see them anyways, the rest are from corporatey sales people (less than a dozen a month) who I mostly expect messages from already anyways, which I suspect training on my sent mail would help fix.

I really don’t see naive bayesians as a failure. training on only the last month’s email really helps as the spam corpus changes frequently and retraining nightly helps spot spam evolution by learning the latest 80-90% spam scorers (instead of 100%, which 80% of spam registers as).

March 19, 2005: 10:45 pm

I will definitely try that. I have now reinstalled SpamBayes as a pop3 filter and retraining it from scratch. Lets see if it does better then last time.

Already it is complaining that I have too high spam to ham ratio(10-1) and that SpamBayes doesn’t give good results in this scenario.

March 19, 2005: 1:46 am

You should all try POPfile, available freely including perl sourcecode at SourceForge.net. It will run on Windows, Linux and maybe many other OSes having a perl interpreter available. It has POP3, SMTP, NNTP and even IMAP-support and a nice webinterface for configuration and training.

I don’t know if it will scale seamlesly to handle 5000 spams daily, but since it’s open source and and supports mySQL databases, there is a good chance that such an amout won’t be a problem at all, at least after making some modifications to it.

Personally, I’ve been using this piece of cake for about 3 three months now and while receiving about 2500-3000 spams in a period of 30 days, it achives an accuracy of 99,58%. This means, only 12 spams got through and there was only 1 false positive while 2.832 spams were blocked successfully.

March 18, 2005: 2:34 pm

How naive bayesian classifier can be made ineffective

A discussion on a failure vector of naive bayesian classifiers…

March 18, 2005: 1:03 am

Thanks for the ideas Glenn. I did retrain once about an year ago. Looks like its time again. What thresholds do you use?

Changing my email is unfortunately not an option because too many people, including my clients, friends etc. have it.

I have been using it for 5-6 years now, maybe more. I still get emails in the old hotmail address :)

March 18, 2005: 12:48 am

Sounds like you need to re-train your filter or change the threshold. I use spambayes to filter out about 1800-2500 spam per day and have been doing so since last spring. So I’m just under your levels. It does suck if you have to restart outlook all the time, but its no so bad once it is up. I used to train it on all spam that I got, but I’ve gotten a little more particular.over time.

Maybe it is time to change email addresses?

March 17, 2005: 5:56 pm

For your amount of SPAM I would say SpamBayes is good enough. It is free.
I get anywhere between 2500-5000 spams everyday. It is just not scalable enough to handle this huge load.


mark
March 14, 2005: 8:37 am

For Outlook I use Matador. It cost around $30 but was well worth it. I can tell you that I get about 300 spams over a weekend. Matador catches about 95% of that. I’m still looking for a good one that is free. -Mark

YOUR VIEW POINT
NAME : (REQUIRED)
MAIL : (REQUIRED)
will not be displayed
WEBSITE : (OPTIONAL)
YOUR
COMMENT :