SpamBayes server-side

This page includes notes from users that have successfully managed to get SpamBayes working server-side. You should also see the related projects page for server-side projects based on SpamBayes.

postfix notes from Jonathan St-Andre

SpamBayes has been installed on one of our MX (running postfix) and is filtering all inbound emails for the whole company (~1000 employees, ~35000 incoming emails daily). The server is a dual PIII 933MHz with 512MB of memory and the load is pretty low (it's almost overkill).

According to the feedback I received from our users, it seems that it tags approximately 90% (for some it goes up to 95%) of the spam correctly. The rest of the spam is tagged as unsure. No false positives. The filter hasn't received too much spam training either, yet. Efficiency has been improving slowly, as we keep training it.

Here's a quick howto:

  1. Create the DB by training with a mailbox of ham and a mailbox of spam. I put the DB in /var/spambayes/hammie.db (as a DBM store).
  2. In master.cf, the smtp line has been changed for the following two lines:
    smtp      inet   n   -   n   -   -   smtpd
      -o content_filter=spambayes:
    
    and the following two lines were added at the end of the file:
    spambayes unix  -   n   n   -   -   pipe
      user=nobody argv=/usr/local/bin/hammiewrapper.sh
        $sender $recipient
    
  3. Here's what the hammiewrapper.sh file looks like:
    #!/bin/sh
    /usr/local/bin/sb_filter.py \
        -d /var/spambayes/hammie.db -f \
    | /usr/sbin/sendmail -f $*
    

qmail notes from Michael Martinez

SpamBayes is installed on our agency's smtp / MX gateway. This machine runs Redhat Linux 7.1, qmail 1.03, qmail-scanner 1.16, and hbedv's Antivir. Incoming mail is accepted by tcpserver and handed off to qmail-scanner. Qmail-scanner runs the virus software (antivir) and hands the message to qmail. Qmail accepts local delivery on all domain-bound email. This email is delivered to ~alias/.qmail-default. (This is a standard configuration for qmail).

~alias/.qmail-default pipes each email through Spambayes. The .qmail-default is set up as follows (note line wrapping):

| /usr/local/spambayes/hammiefilter.py \
    -d /usr/local/spambayes/.hammiedb \
| qmail-remote MSServer.csrees.usda.gov \
    "$SENDER" $DEFAULT@csrees.usda.gov

The permissions for the /usr/local/spambayes directory are set with the following command:

chown -R qmailq.qmail /usr/local/spambayes

As shown above, there are two pipes. The first pipes it through Spambayes. The second pipes it through qmail's remote delivery mechanism, which delivers the email to our Exchange Server.

Delivered emails are filtered on a per-user basis in Outlook by setting the Rules to detect the Spambayes tag in the message header. If the tag reads Spambayes-Classification: spam then the email is either deleted or placed in the user's Spam folder. If it reads Spambayes-Classification: unsure then it's placed in the user's Unsure folder. If it reads Spambayes-Classification: ham then nothing special is done - it is delivered to the user's Inbox as normal.

The user is given the choice of whether to set up his rules or not.

Training of Spambayes is done in the following manner: our users are given my email address and are told that, if they like, they may send emails to me that they consider spam, or that end up being mis-classified by the system. I created two directories:

/usr/local/spambayes/training/spamdir
/usr/local/spambayes/training/hamdir

The emails sent to me by the users are retrieved from the qmail archive and placed into the appropriate directory. When I'm ready to do a training (which I do once or twice a month), I run the following commands:

  1. I use a simple script to insert a blank From: line at the top of each email
  2. I use a simple script to remove the qmail-scanner header from the bottom of each email.
  3. uuencoded attachments are removed
  4. cat /usr/local/spambayes/training/spamdir/* \
        >> /usr/local/spambayes/training/spam
  5. cat /usr/local/spambayes/training/hamdir/* \
        >> /usr/local/spambayes/training/ham 
  6. /usr/local/spambayes/mboxtrain \
        -d /usr/local/spambayes/.hammiedb \
        -g /usr/local/spambayes/training/ham \
        -s /usr/local/spambayes/training/spam
    (This last step can be run without shutting down qmail.)

Most of the time, emails that are sent to me are clearly discernible as to whether they are spam or not. Occasionally there is an email that is borderline, or that one person considers spam but others don't. This is usually things like newsletter subscriptions or religious forums. In this case, I follow my own rule that if there is at least one person in the agency who needs or wants to receive this type of email, and as long as it is non-offensive, work-related, or there are a lot of people in the agency who have an interest in the topic, then I will either train it as ham, or, if it's already being tagged ham, leave it. An example of this are emails that discuss religious topics. There are a lot of people in this agency who are subscribed to religious discussion groups, so in my mind, it's good practice to make sure these messages are not tagged spam.

The above system works well on several levels. It's manageable because there's a central location for training and tagging spam (the smtp server). It's manageable also because our IT PC Support staff does not have to install SpamBayes on each PC nor train all of our users on its use. If a user does not like the way our system tags the emails, he does not have to set up his Outlook rules. But, we've had a good response from the users who are using their Rules. They're willing to put up with one or two mis-classified emails in order to have 95% of their junk email not in their Inbox.

Setting up Server-Side Spam filtering for IMAP

Dave Abrahams has put together notes explaining how he set up server-side filtering with SpamBayes and an IMAP server, using sb_imapfilter.py and sb_filter.py.

An Alternate method of Server Mail filtering in Linux or Unix environments

Aaron Konstam has given us this description of the setup used at Trinity University.

As opposed to other suggested server filtering setups with SpamBayes this approach has the advantage that although the server is doing all the filtering each, user on a client machine has complete control of the training of the filtering process to meet his or her own tastes. It is ideal for the university student lab environment but could be used in commercial environments as well.

The basis of this method is that all the user directories as well as the password authentication data are kept on the server. The authentication data is made available to all the client machines through a well known Unix and Linux service called NIS. Any user can sit at any machine and log in using the same password, change passwords and make any other changes to their user environment.

The home directories are NFS mounted from the server on all the client machines. Therefore, the users home directory on the client machine is identical to the one on the server. The user has access to his hammie.db file, his personal configuration file and all the SpamBayes software that has been installed on the clients. Of course the SpamBayes software is also installed on ther server.

Mail is filtered by the server using a .procmailrc file in the user's directory that runs sb_filter.py. One further thing, which should be obvious, is that we have created MX records so that all mail addressed to a client is actually delivered to the server.

Training can easily be done with a simple script such as:

#!/bin/bash
#script: trainsb

/usr/bin/sb_mboxtrain.py \
    -d $HOME/.hammie.db \
    -g  $HOME/Mail/$1 \
    -s $HOME/Mail/$2

used as follows:

trainsb ham spam

Notice that no proxy servers of any kind are necessary for the user to read their mail, train it, manipulate it or do anything else they want to do. However, if they want to use the web interface on the local client machine to train their mail that is also available to them.

As a side note we run our lab Windows machines in exactly the same way. There is a server for authenticating users and user's directories are kept on a central server. One imagines one could train users mail in exactly the same way on our Windows machines in our labs.