| « LugRadio Live 2005 | BBC Jamcams meet Google Maps » |
To Bayes or not to Bayes
Lets start with a simple question. How much spam do you get ?
Now split up the real email you get into 3 sections.
1. Real Mail - sent by people, to you, specifically by intent
2. List Mail - sent by people, to a list, of which you are a member
3. Feedback / notification Mail - sent by automated process' like a notifier on a website or a system-status notification.
Ok, I get between 7,500 to 8,200 ( actually numbers taken over the last 4 weeks ) emails in a day. And, 6,500 - 6,800 of those are spam. This does not include email that I get as a part of my day job. Emails of type 2 and 3 hardly get any spam. its Emails in Type 1 that get the most email spam. And its really not nice to have to keep hanging around waiting for the client app to clean your email when every time it starts up - what about when you check email over a mobile device or some sorts ?
Last year, I decided it was no longer viable to run spam filtering client side, and moved over to setup a server end config. One of the key ingredients thrown into the mix : Bayes. I had to, as it seems on the onset, have a good, well tuned Bayes process. A few friends joined in. They had their inbound mail queue dropped off on my mail scanner to be cleaned and processed. The results were good to start off with, with a 95% spam trap working. maybe 1% false positives, at the most. Things got better with time. Then Bayes Poison started to kick in. The volume of email being handled was large enough in Oct 2004 that I was rebuilding the bayes db 4 times a day. Just between us friends, the mail system was handling to the tune of 15,000 - 16,000 emails a day.
The reason to go server side spam tacking / anti virus tracking was that on the client side - you want instant emails. Virus can be left behind, I don't even want to know about it - and spam, with a marking - should be filed away, I don't want to look at it, unless I have a reason to. However, the situation that I found the system in was this lag being introduced at the server, it was sometimes taking 4 - 5 minutes to clear a queue, as it awaited a bayes db rebuild. As an experiment, I setup dspam and crm114 on parallel machines.
All inbound email was split three ways, the main system getting, processing and passing on the main queue, dspam and crm114 also getting a copy of the 'raw' msg. processing it and moving it into /dev/null. The aim was to see what kind of stats the dspam and crm114 machines were able to achieve. And while they both made the 97 - 98 % cut, just as the main production machine was running it - they continued to running with 1.1 to 1.3% false positives. Thats way more than any rational person would accept.
Then came the issue of what would happen if Bayes was shut down on the production machine, and move to complete rules, specific network tests only. result ? 0 false positives and > 97.4% accuracy at spam trapping. the dspam and crm114 machines ran for 3 weeks. In all, > 1.3% false positives would have been lost from them. Which would you choose ?
Now my question is : given a set of well defined rules, network tests and some regex, why use Bayes at all ?
You can get more info on dspam here. and the crm114 website is here. Both of them are excellent products, if they work for your role - and they do for a very large number of people.
--
Karanbir Singh { http://www.karan.org/ }
[ Music : Agnelli & Nelson Feat. The Burn - Shivver ]
3 comments
don't think you should write off "bayesian" filters because of
your experience with dspam and crm-114.
See http://plg.uwaterloo.ca/~gvcormac/spamcormack.html
the fact that it creates False Positives, pretty much means that its going to waste my time ( since I am forced to look + scan emails for these FP's ), and its wasting system resources ( bayes is by far the most expensive part of any spam trap that runs with it ).