How naive bayesian classifier can be made ineffective

By Angsuman Chakraborty, Gaea News Network
Friday, March 18, 2005

I just received an email which is clearly spam. However SpamBayes thinks that there is a .13% probability that it is a spam.

I have a big corpus - 623 good and 3113 spam messages.

In a previous post I discussed that SpamBayes is not working for me anymore. This is a good example to that effect.

Frankly there isn’t much SpamBayes or any naive-bayesian-filter can do about it. Take a look at the message below.

Subject: Re: Spyware

Desktop icons are automatically added to the desktop ?
Suffering from Unexplained home page change ?

It's very likely that they are being served up by spyware software

Try 2005 Highest-Rated Spyware Remover:

Free Download Here: https://[Spam Affiliate Link...edited]

Prevent the installation of hijackers spyware
Prevent the installation of hijackers spyware
Prevent the installation of adware spyware
and other potentially unwanted pests.

Try our online scan now: https://[Spam Affiliate Link...edited]

Q-u^1*t [Spam Affiliate Link...edited]

The message headers are equally uninteresting for SpamBayes. Here is what SpamBayes thinks about it.

Spam Score: 13% (0.130563)

word                                spamprob         #ham  #spam
'*H*'                               0.740598            -      -
'*S*'                               0.001723            -      -
'header:In-Reply-To:1'              0.0879684         164     78
'potentially'                       0.147771            8      6
'page'                              0.175691           91     96
'likely'                            0.195508           18     21
'installation'                      0.197697            8      9
'served'                            0.201793            7      8
'subject:: '                        0.227479          282    414
'software'                          0.241129          112    177
'change'                            0.247864           77    126
'suffering'                         0.252365            4      6
'download'                          0.254508           45     76
'to:addr:angsuman'                  0.262497          411    730
'header:Received:4'                 0.265284           88    158
'added'                             0.288257           29     58
'try'                               0.312818           57    129
'other'                             0.313797          171    390
'prevent'                           0.315326           12     27
'scan'                              0.345157            4     10
'being'                             0.34637            58    153
'very'                              0.360483           95    267
'skip:a 10'                         0.361028          183    516
'now:'                              0.370986           10     29
'that'                              0.375101          345   1034
'they'                              0.375572          101    303
'are'                               0.385233          349   1092
'reply-to:none'                     0.393789          504   1635
'here:'                             0.608336           15    117
'adware'                            0.653949            0      1
'unwanted'                          0.665617            2     21
'2005'                              0.79075             1     22
'spyware'                           0.820111            0      4
'url:discon'                        0.820111            0      4
'url:700'                           0.844931            0      5

Handling this spam is very hard for a N-B-C. It doesn’t include any of the standard keywords. It doesn’t directly try to sell you anything. The choice of language shows signs of an intelligent spammer. It includes lots of non-spammy yet contextually relevant words which lowers the score. The only spammy word (quit) has been masked. It even includes ham words in the url.

To a human eye this is clearly a spam. However it is not to a computer.

Note: You can possibly assign very high score to the words spyware or adware, but then they can always pollute the word space with misspellings. Also your friends may want to inform you about AdAware, a valid spyware removal tool.

We need layered spam removal approach at source to handle this type of spammers.

Filed under: Spam Watch, Technology, Web

Tags: , ,
Discussion
March 11, 2010: 3:11 pm

thanks for sharing this information! it’s very useful for a lot people try to understand how we can use this product.

March 6, 2010: 1:35 pm

Nice article…..I really impressed while reading your post…..Thank you so much , it will useful to every

one….

March 4, 2010: 3:48 am

that’s really a fantastic post ! added to my favourite blogs list.. I have been reading your blog last couple of weeks and enjoy every bit. Thanks.

November 9, 2009: 2:46 am

I have been reading your blog last couple of weeks and enjoy every bit. Thanks.


pest control mesa
October 23, 2009: 4:53 am

that’s really a fantastic post ! ! added to my favourite blogs list..

August 21, 2009: 9:08 am

What will be it’s solution?

August 21, 2009: 9:07 am

Then, what’s the solution to overcome this ineffectivity?

May 4, 2005: 1:17 am

[...] hru a link, chances are the URL of your originating site contains some of these keywords. Spammers are getting smarter. It’s a neve [...]

March 20, 2005: 11:51 am

Bayessche Filter sind nur bedingt wirksam

Simple Thoughts stellt ein SPAM-Beispiel vor, an dem sich Bayessche Filter die Zähne ausbeissen. Ich habe von solcher Filterei noch nie viel gehalten, vor allem , weil man dazu erst mal die ganze E-Mail empfangen muss. Viel besser ist…

YOUR VIEW POINT
NAME : (REQUIRED)
MAIL : (REQUIRED)
will not be displayed
WEBSITE : (OPTIONAL)
YOUR
COMMENT :