Adverts

If you run SpamAssassin then you will know that one of the most difficult things to do is train it and that without training it is at best only fair at removing spam. You can turn on auto-learning which works to some extent but my experience of it is that it works for a while and then drifts away from actually classifying a large amount of spam. The real problem is that it is time consuming to run sa-learn (the training program) by hand. Ideally it would be trivially easy for a mail client to tell the IMAP server to tell SpamAssassin that a particular piece of mail was junk. After all most mail clients have a "mark this message as junk" button. Sadly it's not that easy. Spam Assassin runs before the mail is given to the IMAP server and IMAP servers aren't generally clever enough to receive call backs to tell them to run an arbitary program on the server. It's actually debateable whether they ever should.

Since the ideal solution isn't possible one has to look for another solution. The best around at the minute, AFAIK, is to create two special folders where you put email that you want Spam Assassin to learn from. I have a spam folder in my account which is where I have emails that Spam Assassin thinks are spam delivered. Under that I have created two more folders. The first is called train_ham the second train_spam. Mail that is marked as spam but that is actually ham (a false positive) is copied in to the train_ham folder and the original delt with as it should have been. Mail that is spam but not marked as such (false negative) is moved to the train_spam folder. Why the big thing about moving and copying? Simple, the following script deletes the contents of the training folders when it runs. I don't care that it deletes spam but I would rather it didn't delete my ham.

The script below reads the contents of two files that tell it which folders on the system contain training ham and spam. Each line contains a seperate folder name. It gathers up all the spam and ham and places it in two working directories before calling sa-learn on it. This script is inspired by the teach-sa script by Jean-Marc Liotier but I have brought it up to date - the latest sa-learn learn from a directory of files. There are no additional dependencies on this script.

# Complete path to the directory where train-sa.sh and the 
# configuration files can be found. 
workdir="/root/train-sa/"
# Directory to hold training spam in.
spamdir=$workdir/spam
# Directory to hold training ham in.
hamdir=$workdir/ham
#Remove the old directories if they exist
rm -rf $spamdir
rm -rf $hamdir
# Make the training directories
mkdir $spamdir
mkdir $hamdir
#Move the spam from all the users spam folders into the spam learning folder
# CertainSpamFolderList contains a list of folders on the system that contain
# spam that we are to learn from. One folder per line.
for CertainSpamFolder in `cat $workdir/CertainSpamFolderList`
do
for CertainSpamIMAPSubFolder in cur new
do
echo 'Gathering spam from ' $CertainSpamFolder/$CertainSpamIMAPSubFolder
mv $CertainSpamFolder/$CertainSpamIMAPSubFolder/* $spamdir
done
done
#Move the mis-classified ham into the ham learning folder
# CertainHamFolderList contains a list of folders on the system that contain
# ham that we are to learn from. One folder per line.
for CertainHamFolder in `cat $workdir/CertainHamFolderList`
do
for CertainHamIMAPSubFolder in cur new
do
echo 'Gathering ham from ' $CertainHamFolder/$CertainHamIMAPSubFolder
mv $CertainHamFolder/$CertainHamIMAPSubFolder/* $hamdir
done
done
#It is somewhat quicker to add the --no-sync flag and then run
#sa-learn --sync at the end but that causes sa-learn to create a
#folder called '= ' (equals space) in the working directory.
echo 'Learning from spam'
sa-learn --dbpath /var/mail/.spamassassin --spam $spamdir
echo 'Learing from ham'
sa-learn --dbpath /var/mail/.spamassassin --ham $hamdir
#Tidy up the spam and ham folders
rm -rf $spamdir
rm -rf $hamdir

Run this script from a cron job each night to learn any spam or ham that you have received. You are free to modify and distribute this script. I take no responsibility if you break anything using it.

 

Adverts

Donate and Help

Please support this site and
Bandwidth doesn't grow on trees y' know :o)

Adverts

Get Adsense