Training Spam Assassin's accuracy with sa-learn

This is a guide to effectively use SpamAssassin's "sa-learn" to help train tokens for spam flagging accuracy.

What is "sa-learn"?

Given a typical selection of your incoming mail classified as spam or ham (non-spam), this tool will feed each mail to SpamAssassin, allowing it to "learn" what signs are likely to mean spam, and which are likely to mean ham. Simply run this command once for each of your mail folders, and it will learn from the mail therein. Note that csh-style globbing in the mail folder names is supported; in other words, listing a folder name as will scan every folder that matches. See Mail::SpamAssassin::ArchiveIterator for more details.

SpamAssassin remembers which mail messages it has learned already, and will not re-learn those messages again, unless you use the --forget* option. Messages learned as spam will have SpamAssassin markup removed, on the fly. If you make a mistake and scan a mail as ham when it is spam, or vice versa, simply rerun this command with the correct classification, and the mistake will be corrected. SpamAssassin will automatically "forget" the previous indications. Users of spamd who wish to perform training remotely, over a network, should investigate the spamc -L switch.

Official sa-learn documentation can be found here: sa-learn doc

Getting started

This will only work for email accounts using IMAP.

Create the spam/ham folders

- In your preferred mail client, open the email account you are configuring this for.
- Inside of your Inbox create at least two new folders/directories. One for the untrained spam, and one (or more) for your ham (not spam). For this article, we'll be using "Junk" as our untrained spam folder, and "Seen" as our ham folder.

Being diligent and pro-active

- Create your new ritual of how you regularly check email. When you receive new email (and read it), start moving the email to one of the two folders. If the email is good mail, move it to "Seen". If it's bad/spam that SpamAssassin didn't already catch, move it to "Junk".
- This is the most difficult part of training properly, however will provide the best results. It will take some time, however the more tokens SpamAssassin is able to collect, the more accurate it will become!

Running the sa-learn command

Executing sa-learn commands must be performed by the 'mail account owner' to function properly. When attempting to run these commands as the 'root' user, use the following syntax to run them as the user:

sudo -H -u <CPANELUSER> bash -c '/usr/local/bin/sa-learn <COMMANDS>'

From the command line run sa-learn on your email account's "Junk" folder.
- sa-learn -p /home/USER/.spamassassin/user_prefs --spam /home/USER/mail/DOMAIN.TLD/ACCOUNT/.Junk/{cur,new}
- Depending on how many messages you have (and if you've run it before or not) you'll see results similar to this: Learned tokens from 214 message(s) (1009 message(s) examined)
From the command line run sa-learn on your email account's "Seen" folder. Note the differences here, "--ham" and "Seen"
- sa-learn -p /home/USER/.spamassassin/user_prefs --ham /home/USER/mail/DOMAIN.TLD/ACCOUNT/.Seen/{cur,new}
- You'll see similar results for this, all depending on how many messages on are in the folder.

Checking learned tokens.

Run: sa-learn --dump magic

# sa-learn --dump magic
    0.000 0 3 0 non-token data: bayes db version
    0.000 0 1242 0 non-token data: nspam
    0.000 0 3872 0 non-token data: nham
    0.000 0 155784 0 non-token data: ntokens
    0.000 0 1404770116 0 non-token data: oldest atime
    0.000 0 1411483933 0 non-token data: newest atime
    0.000 0 0 0 non-token data: last journal sync atime
    0.000 0 1410340539 0 non-token data: last expiry atime
    0.000 0 5529600 0 non-token data: last expire atime delta
    0.000 0 137169 0 non-token data: last expire reduction count

nspam - Number of spam messages examined.
nham - Number of (non-spam) messages examined.
ntokens - Number of tokens learned.

You can run these commands manually whenever you'd like, especially if you like control, however it can become a chore and demanding process. A lot of people prefer cron jobs for this. I'd only recommend that you only perform the cron jobs once per day, during non-peak hours.

Also, remember that training works both ways, if non-spam is making it in the spam folder, move it into "Seen" so it can learn properly for false-positives.