Craig Box's journeys, stories and notes...


Posts Tagged ‘spam’

SpamAssassin 3.2.0 backport for Ubuntu Dapper

Wednesday, June 6th, 2007

I've built packages for SpamAssassin 3.2.0 for Ubuntu Dapper. They are available in my firewall repository with the dependencies (libnet-dns-perl, libnetaddr-ip-perl, libmail-spf-perl):

deb http://ubuntu.hs.net.nz dapper firewall

If you use this repository, you'll get a new version of ClamAV, and some other packages also. Beware.

It was a bit of a mission to build, but made easier with the Prevu tool. This is like pbuilder for backports, and anyone doing anything with backports should use it. You can use the 0.4.1 release on Sourceforge on Dapper.

Graphing and analysing SpamAssassin

Friday, July 21st, 2006

Here's something simple that I never thought of - props to my workmate Tom for coming up with this.

SpamAssassin scores plot

This is a gnuplot graph of our SpamAssassin scores. The code used to generate it is on the bottom of the SpamAssassin notes page at the WLUG wiki.

The grouping around -100 is caused by the whitelist rule, which scores messages down 100 points (ensuring they are never marked as spam). Usefully, this rule doesn't count towards the threshold needed to be reached before a message is learnt as ham by the Bayesian categoriser.

We seem to have a reasonably normal distribution of good mail, between about -5 and +5, and a reasonably normal distribution of spam, between 10 and 60. This means our filter is working really well. What I took from this, is that it was safe to up the ham learning threshold - it defaults to -0.1, but I've set ours to 1, as we have a lot of rules that score all messages up quite equally.

Also useful is sa-stats.pl, which generates a summary table of how often rules were hit on messages that were either marked as ham or spam. As of today:

TOP SPAM RULES FIRED
———————————————————————-
RANK RULE NAME                COUNT  %OFMAIL %OFSPAM  %OFHAM
———————————————————————-
   1 RAZOR2_CHECK               153  38.65  76.50   1.00
   2 BAYES_99                   150  37.41  75.00   0.00
   3 RAZOR2_CF_RANGE_51_100     149  37.41  74.50   0.50
   4 RAZOR2_CF_RANGE_E8_51_100  128  31.92  64.00   0.00
   5 URIBL_JP_SURBL             125  31.17  62.50   0.00
   6 URIBL_BLACK                120  29.93  60.00   0.00
   7 URIBL_SC_SURBL             105  26.18  52.50   0.00
   8 URIBL_OB_SURBL             105  26.18  52.50   0.00
   9 HOST_EQ_D_D_D_D            102  28.93  51.00   6.97
  10 RCVD_IN_SORBS_DUL           92  23.19  46.00   0.50
TOP HAM RULES FIRED
———————————————————————-
RANK RULE NAME                COUNT  %OFMAIL %OFSPAM  %OFHAM
———————————————————————-
   1 AWL                        193  57.86  19.50  96.02
   2 BAYES_00                   183  45.64   0.00  91.04
   3 RELAY_IS_203                78  20.20   1.50  38.81
   4 FH_RELAY_NODNS              75  25.44  13.50  37.31
   5 HTML_MESSAGE                72  35.66  35.50  35.82
   6 UPPERCASE_25_50             60  14.96   0.00  29.85
   7 FORGED_RCVD_HELO            56  36.16  44.50  27.86
   8 USER_IN_WHITELIST           23   5.74   0.00  11.44
   9 NO_REAL_NAME                20  13.22  16.50   9.95
  10 SPF_HELO_PASS               19   5.49   1.50   9.45

I toyed with changing the scores on rules that hit lots on both ham and spam, such as FORGED_RCVD_HELO, but they contribute only very small weightings overall at the moment.