Jul29

spam and training spam

Posted by Karanbir Singh on 29/Jul/2009  ~  Posted in: Linux, Internet, Software, EMail, Spam

I totally subscribe to the idea of email being a fantastic communication tool, way better than anything the web has put out as yet, and because its async and works with priority tagging its automatically better than any form of synchronous communication. Which brings me to the spam-tracking-setup I have in place. I prefer to see the few spam emails and therefore have a higher required-score to mark an email as spam, rather than have a lower-score and have the odd non-spam emails get marked as spam and therefore lost in the spam noise (1). If people send me email, there needs to be a high level of assurance that its going to come through to me! And if that comes with the fact that I need to see some spam in my inbox, so be it.

So, having used spamassassin for years and years I finally decided that it would be quite nice to have a loop to send missed spam back to the server and have spamassassin 'lean' about it. I know, it sounds great - just something that I never bothered setting up. Ever. And so in Jan 2009, I setup the feedback loop [2]. To keep things interesting, I also setup a parallel delivery line. So every email recieved at the MTA was duplicated and a parallel delivery made into two destinations. One destination which I use for my emails and also has the benefit of this feedback to learn about spam / ham as well as an autowhitelist process. The another delivery destination used the exact same spamassassin ruleset, but did *not* get the benefits from the spam-learn and ham-learn process, and did not use the autowhitelist mechanism. The aim being to measure how much of an impact this 'spam training' has on the overall result. One more thing that I did was to split all high-score ( > 14 ) spam away from the Spam folder into the HighScore folder.

Last week, I wrote a small script that compared the results. Its amazing. And rather than talk first, here are the numbers:

Jan 2009 94%  : 7.1% : 38% 
Feb 2009 98%  : 5.8% : 42%
Mar 2009 100% : 6.0% : 37%
Apr 2009 100% : 3.6% : 33%
May 2009 100% : 3.1% : 28%
Jun 2009 100% : 3.1% : 24%

Read these numbers:

  • First number shows the percentage of my inbox I was able to map back to the duplicated delivery destination. The reason its lower than 100% for some months is that I deleted some emails, without realising that the dupe delivery was running. Emails deleted were mostly emails with large attachments or things that I really dont care about, and in some cases spam.
  • Second number shows what percentage of spam made it into the inbox as compared with the overall number of spams ( things I tagged as spam manually + emails in the Spam folder + HighScore folder )
  • Third number shows the percentage of spam marked as high-scoring, emails that get more than 14 in the spam score

So the overall percentage of spam still making its way into my inbox has halved. Which is great. But... the amount of spam making the high-score grade has also fallen, by leaps and bounds. And the thing is - I tend to sometimes skim though the Spam folder to see if it hit any false positives. Everything in HighScore is really spam, so ignoring that is fine. However, it used to take me 3 - 5 minutes to skim the Spam folder, and I've just given up on that process now completely. There is way too much traffic in there. So whats going on ? spamassassin clearly knows more about my inbox, which in turns leads to lesser spam in there, but its also not scoring stuff as high as it used to earlier. Now see how it works for the alternative delivery point, that didnt get this spam-learning :

Jan 2009 94%  :  7.1%  : 38% 
Feb 2009 98%  :  6.9%  : 37%
Mar 2009 100% : 11.8%  : 37%
Apr 2009 100% :  8.5%  : 41%
May 2009 100% :  7.1%  : 42%
Jun 2009 100% :  8.1%  : 39%

Taking May as an example ( since I had 100% match ), I looked at where all the emails marked as high score in the second place landed in my main inbox.

HighScore -> HighScore : 61%
HighScore -> Spam : 35%
HighScore -> Inbox : 4%

So, 61% of the emails in HighScore were also in HighScore in the main mailbox, however 35% of the emails that spamassassin though were HighScore without the learning, ended up in the Spam folder ( which gets scores greater than 7 but less than 14 ). And even more amazingly, 4% of all emails in the HighScore folder without the learning end up in the Inbox after the learning. Which in turn means that out of the 3.1% of all spam making it into the inbox in May 2009, 1.7% ( or over half! ) were actually emails that spamassassin marked as HighScore without the learning.

hummm.

- KB

[1] : I get in excess of 3k emails in a day on average, 80% of that is spam, large portion of the rest is mailing lists, but even then a large number are real proper emails from people and things / places I've signed up for email from.

[2] : I use cyrus-imapd for all my emails, so the loop was essentially just folders that other process's could look at and handle according to a predefined set of rules. Eg, insure that the emails in learn-as-spam got passed into sa-learn with --spam etc.