How naive bayesian classifier can be made ineffective
By Angsuman Chakraborty, Gaea News NetworkFriday, March 18, 2005
I just received an email which is clearly spam. However SpamBayes thinks that there is a .13% probability that it is a spam.
I have a big corpus - 623 good and 3113 spam messages.
In a previous post I discussed that SpamBayes is not working for me anymore. This is a good example to that effect.
Frankly there isn’t much SpamBayes or any naive-bayesian-filter can do about it. Take a look at the message below.
Subject: Re: Spyware Desktop icons are automatically added to the desktop ? Suffering from Unexplained home page change ? It's very likely that they are being served up by spyware software Try 2005 Highest-Rated Spyware Remover: Free Download Here: https://[Spam Affiliate Link...edited] Prevent the installation of hijackers spyware Prevent the installation of hijackers spyware Prevent the installation of adware spyware and other potentially unwanted pests. Try our online scan now: https://[Spam Affiliate Link...edited] Q-u^1*t [Spam Affiliate Link...edited]
The message headers are equally uninteresting for SpamBayes. Here is what SpamBayes thinks about it.
Spam Score: 13% (0.130563) word spamprob #ham #spam '*H*' 0.740598 - - '*S*' 0.001723 - - 'header:In-Reply-To:1' 0.0879684 164 78 'potentially' 0.147771 8 6 'page' 0.175691 91 96 'likely' 0.195508 18 21 'installation' 0.197697 8 9 'served' 0.201793 7 8 'subject:: ' 0.227479 282 414 'software' 0.241129 112 177 'change' 0.247864 77 126 'suffering' 0.252365 4 6 'download' 0.254508 45 76 'to:addr:angsuman' 0.262497 411 730 'header:Received:4' 0.265284 88 158 'added' 0.288257 29 58 'try' 0.312818 57 129 'other' 0.313797 171 390 'prevent' 0.315326 12 27 'scan' 0.345157 4 10 'being' 0.34637 58 153 'very' 0.360483 95 267 'skip:a 10' 0.361028 183 516 'now:' 0.370986 10 29 'that' 0.375101 345 1034 'they' 0.375572 101 303 'are' 0.385233 349 1092 'reply-to:none' 0.393789 504 1635 'here:' 0.608336 15 117 'adware' 0.653949 0 1 'unwanted' 0.665617 2 21 '2005' 0.79075 1 22 'spyware' 0.820111 0 4 'url:discon' 0.820111 0 4 'url:700' 0.844931 0 5
Handling this spam is very hard for a N-B-C. It doesn’t include any of the standard keywords. It doesn’t directly try to sell you anything. The choice of language shows signs of an intelligent spammer. It includes lots of non-spammy yet contextually relevant words which lowers the score. The only spammy word (quit) has been masked. It even includes ham words in the url.
To a human eye this is clearly a spam. However it is not to a computer.
Note: You can possibly assign very high score to the words spyware or adware, but then they can always pollute the word space with misspellings. Also your friends may want to inform you about AdAware, a valid spyware removal tool.
We need layered spam removal approach at source to handle this type of spammers.
March 11, 2010: 3:11 pm
thanks for sharing this information! it’s very useful for a lot people try to understand how we can use this product. |
March 6, 2010: 1:35 pm
Nice article…..I really impressed while reading your post…..Thank you so much , it will useful to every one…. |
March 4, 2010: 3:48 am
that’s really a fantastic post ! added to my favourite blogs list.. I have been reading your blog last couple of weeks and enjoy every bit. Thanks. |
November 9, 2009: 2:46 am
I have been reading your blog last couple of weeks and enjoy every bit. Thanks. |
pest control mesa |
May 4, 2005: 1:17 am
[...] hru a link, chances are the URL of your originating site contains some of these keywords. Spammers are getting smarter. It’s a neve [...] |
March 20, 2005: 11:51 am
Bayessche Filter sind nur bedingt wirksam Simple Thoughts stellt ein SPAM-Beispiel vor, an dem sich Bayessche Filter die Zähne ausbeissen. Ich habe von solcher Filterei noch nie viel gehalten, vor allem , weil man dazu erst mal die ganze E-Mail empfangen muss. Viel besser ist… |
Software Testing