Geek Thoughts: January 2013

Sunday, 20 January 2013

Identifying spam

Refugees United

Yesterday, I participated in the Refugees United Modding Day 2013 organised by Rewired State for Refugees United. I worked on a hack that tried to implement and algorithm to identify spam and eventually also identify spammers. Our presentation was OK but not fantastic as time was short. The short of it was that by the end of the day I wasn't convinced the algorithm as coded worked so today I decided to get some statistics out of it and tweak it a little bit.

Bayesian Filters

The idea was to implement a Bayesian filtering algorithm to detect spam. This type of algorithms requires initial training using a body of existing messages comprising both spam and non-spam where you tell it which is which. By doing that, it will build a corpus of know good data and a corpus of known spam data. It will then use that stored information to score incoming messages. The resulting score is a probability between 0 and 1 of the message being spam (in practice the probability is between 0.01 and 0.99 because you can never be completely sure). After that, each time the algorithm gets its decision wrong, the user has the ability to correct it and it will learn from the correction. This is exactly the sort of algorithms that is implemented in popular email clients like Mozilla Thunderbird.

In order to train the filter, we had a file containing 5500 SMS messages, some spam, some ham. So in order to see how good the filter was, I decided to do the following:

Train the filter on the first few hundred messages in the file;
The score the remainder of the file and see for each score whether it correctly identified the message as ham or spam.

Tokenizing

The main factor that influences how such a filter works is the way the message is broken into individual tokens. The simplest algorithm is to break the message into words. Some words will be used mainly in spam messages, others mainly in innocent messages and yet others in both. Using 5% of the file for training and the remainder for scoring and applying that simple tokenizing algorithm, I got the following statistics:

Correct guesses: 72.03%
Neutral score (doesn't know if it's spam or ham): 25.68%
False positives (identified as spam when it is ham): 1.74%
False negatives (identified as ham when it is spam): 0.55%

That's not bad for a basic algorithm! So I decided to apply three additional tokenizing techniques on top of the core one and see what would happen:

The first technique consists in adding word pairs in addition to individual words. The logic behind this is that words like fantastic and offer may not be typical of spam on their own but the combination fantastic offer may be.
The second technique consists in adding the lowercase version of every word to the list of tokens in an attempt to see if weird capitalisation would catch spammers out.
The third one is a combination of the first two.

The results show that word pairs seem to be better than capitalisation but both together are even better.

Algorithm comparison: overall

However, this is not the whole story. The important metric is the bad stuff that ends up in the user's inbox as well as the good stuff that gets rejected. The former is composed of two parts: messages that the filter considered good when in fact they should have been caught as spam (false negatives) and messages that had a neutral score that are in fact spam (bad neutral). The latter is messages that the filter marked as spam when in fact they are good (false positives). Looking at those metrics, the result is not great.

Algorithm comparison: mistakes

False positives increase slightly with word pairs and quite significantly when introducing capitalisation variation. This may mean that a lot of legit users actually capitalise their messages like spammers. So not such a good algorithm after all.

Introducing Bias

One aspect of spam is that it is better to have false negatives than false positives. That is, it is better to have a bit of spam in the user's inbox but not block legit messages by mistake. So one typical way to do this is to introduce some bias into the algorithm to give a bit more weight to positive probabilities. Compared to the basic algorithm, it definitely cuts into false positives.

Algorithm bias comparison: overall

Algorithm bias comparison: mistakes

The Impact of Training

An interesting statistic is to see how much training the algorithm makes a difference. So I decided to compare the baseline where 5% of the file is used for training to a version where 10 of the file is used for training.

Algorithm training comparison: overall

Algorithm training comparison: mistakes

And of course, a well trained biased algorithm returns some decent numbers too when compared to the baseline.

Algorithm training and bias comparison: overall

Algorithm training and bias comparison: mistakes

Conclusion

The strength of statistical algorithms like Bayesian filtering is that it learns from the mistakes it does and it doesn't matter if spammers change their wording, the algorithm will adapt. This is demonstrated by the reasonably high rate of success reached by a very basic implementation of the technique with training on a few hundred messages.

In order to improve the efficiency of such an algorithm, there are a few options:

Include message meta-data such as sender and recipient details, in addition to the message text: this would also be a first step to identify spammers in addition to spam, which was one of the aims of the hack.
Use advanced tokenization techniques such as sparse binary polynomial hashing.
Use Markovian discrimination rather than Bayesian filtering.

All those improvements would come at the cost of increased complexity and processing power requirements so may not all be practical.

Sunday, 13 January 2013

for and while constructs in bash

The `for` construct

When you want to iterate over a list in bash, the first thing that comes to mind is to use a for loop, like this:

for f in "abc def"; do
    echo $f
done

Simple for loop

This works great when the list to iterate over is short and is composed of items that do not contain any white space. When they do, or the list is long, this construct will get into trouble. Let's demonstrate with a simple example. If I create a file with one item per line, 3 lines like this:

one
two
third line

Simple file called test

Then the first attempt at using a for loop would be:

for f in $(cat test); do
    echo $f
done

Simple for loop to read the file

The result is not quite what was expected:

one
two
third
line

Output of the for loop

You can put double quote in different places, this will not solve the problem. This is because the for construct splits items against white space and as far as it's concerned, an actual space character or a carriage return are the same and count as separators. Another limitation of the for construct is that the sub-command contained in $(...) needs to be fully executed before for can even start. If the output is large, it can run out of memory or just take a long time to get started.

The `while` construct

Fortunately, bash has another construct that can bypass those limitations, the while construct. It works slightly differently and needs the help of the read command.

cat test | while read f; do
    echo $f
done

A simple while example

And the result is:

one
two
third line

Output of the while loop

This works because the read command reads a full line and does not split on white space. Therefore the value that f is set to is a complete line in the file. The other advantage is that the pipe actually streams the output of the cat command to while and read, meaning that there is no need to wait until it's finished to handle its output. One typical use of that construct is when using the find command: with modern operating systems, file names can have spaces in them and even with a tight condition, find can return hundreds of lines of output.

Use the right tool for the job

So when should you use which construct?

If you are dealing with a list that can be large or where each item can contain space characters, use while;
If you are dealing with a short list where no item can contain a space character, you can use for.