# Occam’s razor-based spam filter

- Tiago A. Almeida
^{1}Email author and - Akebo Yamakami
^{2}

**3**:67

**DOI: **10.1007/s13174-012-0067-x

© The Brazilian Computer Society 2012

**Received: **22 July 2011

**Accepted: **10 August 2012

**Published: **2 October 2012

## Abstract

Nowadays e-mail spam is not a novelty, but it is still an important rising problem with a big economic impact in society. Spammers manage to circumvent current spam filters and harm the communication system by consuming several resources, damaging the reliability of e-mail as a communication instrument and tricking recipients to react to spam messages. Consequently, spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on the minimum description length principle. Furthermore, we have conducted an empirical experiment on six public and real non-encoded datasets. The results indicate that the proposed filter is fast to construct, incrementally updateable and clearly outperforms the state-of-the-art spam filters.

### Keywords

Minimum description length Spam filter Text categorization Knowledge-based system Machine learning## 1 Introduction

E-mail is one of the most popular, fastest and cheapest means of communication. It has become a part of everyday life for millions of people, changing the way we work and collaborate. E-mail is not only used to support conversation but also as a task manager, document delivery system and archive. The downside of this success is the constantly growing volume of e-mail spam we receive. The problem of spams can be quantified in economical terms since many hours are wasted everyday by workers. It is not just the time they waste reading the spam but also the time they spend deleting those messages [33].

According to annual reports, the amount of spam is frightfully increasing. In absolute numbers, the average of spams sent per day increased from 2.4 billion in 2002^{1} to 300 billion in 2010.^{2} The same report indicates that more than 90 % of incoming e-mail traffic is spam. According to the US Technology Readiness Survey,^{3} the cost of spam in terms of lost productivity in US has reached US$ 21.58 billion annually, while the worldwide productivity cost of spam is estimated to be US$ 50 billion. On a worldwide basis, the information technology cost of dealing with spam was estimated to rise from US$ 20.5 billion in 2003 to US$ 198 billion in 2009.

According to a report published by McAfee,^{4} the cost in lost productivity per day per user is approximately equal to US$ 0.50, based on the user’s having to spend 30 s for dealing with only two spam messages each day and the user’s spam filter working at 95 percent accuracy (value higher than the average achieved by the majority of available anti-spam filters). Therefore, the productivity loss per employee per year due to spam is approximately equal to US$ 182.50. Supposing a company with 1,000 workers earning US$ 30 per hour, it would suffer about US$ 182,500 per year in lost productivity. This works out to more than US$ 41,000 per 1 percent of spam allowed into a company. Fortunately, many solutions are being proposed to avoid this “plague” and one of more promising is the use of machine learning techniques for automatically filtering e-mail messages.

Many methods have been proposed to automatic spam filtering, such as rule-based approaches, white and blacklists, collaborative spam filtering, challenge and response systems, methods which take into account the sender’s domain, clustering [24], among many others. However, among all proposed techniques, machine learning algorithms have been achieved more success [22]. These methods include approaches that are considered top-performers in text categorization, like rule induction algorithm [20, 21], Rocchio [10, 31, 43], Boosting [19, 40], decision tree [48], memory-based learning [12], Naïve Bayes (NB) classifiers [3, 5, 7, 11, 42, 46, 50] and support vector machines (SVM) [1, 25, 26, 30, 34, 39]. The two latter currently appear to be the best anti-spam filters presented in the literature [8, 22, 23, 49].

It has been observed that compression-based techniques seem to provide a promising alternative approach to categorization tasks, as stated by Frank et al. [27]. However, the authors showed that for text categorization, such methods do not compete with the state-of-the-art techniques. They clearly state that they do not believe their results are specific to the analyzed compression models, because if the occurrence of a single word determines whether a message belongs to a category or not, any compression scheme would likely fail to classify the message correctly. According to the authors, machine learning schemes fare better because they automatically eliminate irrelevant features.

Similar to the work of Frank et al. [27], Teahan and Harper [47] performed extensive experiments to evaluate the performance of different approaches for text categorization on the standard Reuters-21578 collection. They compared compression-based algorithms, such as prediction by partial matching (PPM) with Naïve Bayes classifiers and SVM. Based on the obtained results, the authors conclude that compression-based models perform better than word-based Naïve Bayes techniques and approach the performance of linear SVM.

Fortunately, Bratko et al. [18] investigated the performance achieved by data compression models in spam filtering task. They evaluate the filtering performance of two different compression algorithms: dynamic Markov compression (DMC) and prediction by partial matching (PPM). The results of their evaluation indicate that compression models are surprisingly good and their performances are comparable with Bogofilter, the best spam filter at that time. However, the most current published results [16, 22, 29] including TREC 2007 and CEAS 2008 Live Competition anti-spam challenges, indicate that the spam filters based on Naïve Bayes and SVM classifiers outperform such compression-based models.

A relatively recent method for inductive inference which is still rarely employed in text categorization tasks is the minimum description length (MDL) principle. It states that the best explanation, given a limited set of observed data, is the one that yields the greatest compression of the data [15, 28, 41].

In this paper, we present a spam filtering approach based on the MDL principle and compare its performance with seven different models of Naïve Bayes classifiers and the SVMs. Here, we carry out an evaluation with the practical purpose of filtering e-mail spams in order to compare the currently top-performer’s spam filters. We have conducted an empirical experiment using six well-known, large, and public databases and the reported results indicate that our approach outperforms currently established spam filters.

A preliminary version of this work was presented at ACM SAC 2010 [4]. Here, we significantly extend the performance evaluation. First, we offer much more details about the new method, its main features and how it works. Second, and the most important, we compare the proposed approach with several established classifiers instead of only two, as presented in the mentioned paper.

The remainder of this paper is organized as follows: Sect. 2 presents the basic concepts regarding the main spam filtering techniques. In Sect. 3, we describe a new approach based on the MDL principle. Experimental results are showed in Sect. 4. Finally, Sect. 5 offers conclusions and directions for future work.

## 2 Basic concepts

In the setting of spam filtering, there are only two category labels: \(\mathtt {spam}\) and \(\mathtt {legitimate}\) (also called \(\mathtt {ham}\)). Each message \(m \in \mathcal M \) can only be assigned to one of them, but not to both.

Assuming that each message \(m\) is composed by a set of tokens \(m = \{t_1, \ldots , t_{|m|}\}\), where each token \(t_k\) corresponds to a word (“adult”, for example), a set of words (“to be removed”), or a single character (“$”), we can represent each e-mail by a vector \(\varvec{x} = \langle x_1, \ldots , x_{|m|} \rangle \), where \(x_1, \ldots , x_{|m|}\) are values of the attributes \(X_1, \ldots , X_{|m|}\) associated with the tokens \(t_1, \ldots , t_{|m|}\). In the simplest case, each term represents a single token and all attributes are Boolean: \(X_i = 1\) if the message contains \(t_i\) or \(X_i = 0\), otherwise.

Alternatively, attributes may be an integer computed by token frequencies (TF) representing how many times each token occurs in the message. A third alternative is to associate each attribute \(X_i\) to a normalized TF, \(x_i = \frac{n(t_i)}{|m|}\), where \(n(t_i)\) is the number of occurrences of the token represented by \(X_i\) in \(m\), and \(|m|\) is the length of \(m\) measured in token occurrences. Normalized TF takes into account the term repetition versus the size of message [13].

## 3 Spam filtering based on minimum description length principle

The MDL principle is a formalization of Occam’s razor in which the best hypothesis for a given set of data is the one that yields compact representations. The traditional MDL principle states that the preferred model results in the shortest description of the model and the data, given this model. In other words, the model that best compresses the data is selected. This model selection criterion naturally balances the complexity of the model and the degree to which this model fits the data. This principle was first introduced by Rissanen [41] and it becomes an important concept in information theory.

Let \(\mathcal Z \) be a finite or countable set and let \(P\) be a probability distribution on \(\mathcal Z \). Then there exists a prefix code \(C\) for \(\mathcal Z \) such that for all \(z \in \mathcal Z \), \(L_{C}(z) = \lceil -\mathrm{log}_{2} P(z)\rceil \). \(C\) is called the code corresponding to \(P\). Similarly, let \(C\) be a prefix code for \(\mathcal Z \). Then there exists a (possibly defective) probability distribution \(P\) such that for all \(z \in \mathcal Z \), \(-\mathrm{log}_{2} P^{\prime }(z) = L_{C^{\prime }} (z)\). \(P^{\prime }\) is called the probability distribution corresponding to \(C^{\prime }\). Thus, large probability according to \(P\) means small code length according to the code corresponding to \(P\) and vice versa [15, 28, 41].

The goal of statistical inference may be cast as trying to find regularity in the data. Regularity may be identified with ability to compress. MDL combines these two insights by viewing learning as data compression: it tells us that, for a given set of hypotheses \(\mathcal H \) and data set \(\mathcal D \), we should try to find the hypothesis or combination of hypotheses in \(\mathcal H \) that compresses \(\mathcal D \) most [15, 28, 41].

This idea can be applied to all sorts of inductive inference problems, but it turns out to be most fruitful in problems of model selection and, more generally, dealing with overfitting [28]. An important property of MDL methods is that they provide automatic and inherent protection against overfitting and can be used to estimate both the parameters and the structure of a model. In contrast, to avoid overfitting when estimating the structure of a model, traditional methods such as maximum likelihood must be modified and extended with additional, typically ad hoc principles [28].

In essence, compression algorithms can be applied to text categorization by building one compression model from the training documents of each class and using these models to evaluate the target document.

### 3.1 The MDL anti-spam filter

Given a set of classified training messages \(\mathcal M \), the task is to assign a target e-mail \(m\) with an unknown label to one of the classes \(c \in \{spam, ham\}\). So, the method measures the increase of the description length of the data set as a result of the addition of the target document. Finally, it chooses the class for which the description length increase is minimal.

- 1.
Tokenization: the classifier extracts all terms of the new message \(m = \{t_1, \ldots , t_{|m|}\}\);

- 2.Compute the increase of the description length when \(m\) is assigned to each class \(c \in \{spam, ham\}\):$$\begin{aligned}&L_{m} (spam) = \sum ^{|m|}_{i=1} \left\lceil -\mathrm{log}_{2} \left(\frac{n_{spam} (t_{i}) + \frac{1}{|\Omega |}}{n_{spam} + 1}\right) \right\rceil \\&L_{m} (ham) = \sum ^{|m|}_{i=1} \left\lceil -\mathrm{log}_{2} \left(\frac{n_{ham} (t_{i}) + \frac{1}{|\Omega |}}{n_{ham} + 1}\right) \right\rceil \end{aligned}$$
- 3.if \(L_{m} (spam) < L_{m} (ham)\), thenand \(m\) is classified as spam; otherwise,$$\begin{aligned} mdlOutput = 1 - \left(\frac{L_{m} (spam)}{L_{m} (ham)}\right) \end{aligned}$$and \(m\) is labeled as ham.$$\begin{aligned} mdlOutput = -1 \times \left[1 - \left(\frac{L_{m} (ham)}{L_{m} (spam)}\right)\right] \end{aligned}$$
- 4.
Training method.

### 3.2 Preprocessing and tokenization

We did not perform language-specific preprocessing techniques such as word stemming, stop word removal, or case folding, since other researchers found that such techniques tend to hurt spam-filtering accuracy [22, 38, 51]. However, we use an e-mail-specific preprocessing before the classification task. In this way, we employ the Jaakko Hyvattis normalizemime.^{5} This program converts the character set to UTF-8, decoding Base64, Quoted-Printable and URL encoding and adding warn tokens in case of encoding errors. It also appends a copy of HTML/XML message bodies with most tags removed, decodes HTML entities and limits the size of attached binary files.

Tokenization is the first stage in the classification pipeline; it involves breaking the text stream into tokens (“terms”), usually by means of a regular expression. We consider in this work that terms start with a printable character, followed by any number of alphanumeric characters, excluding dots, commas and colons from the middle of the pattern. With this pattern, domain names and mail addresses will be split at dots, so the classifier can recognize a domain even if subdomains vary [45]. The actual tokenization schema is defined by the following regular expression:

These pattern use Unicode categories: means everything except whitespace and control chars (POSIX [:graph:]); \(\backslash \)p{L}\(\backslash \)p{M}\(\backslash \)p{N} collectively match all alphanumerical characters ([:alnum:] in POSIX).

As proposed by Drucker et al. [25] and Metsis et al. [38], we do not consider the number of times a token appears in each message. In this way, each token is computed only once per message it appears.

### 3.3 Training method

- 1.For each term \(t_i\) of \(m\) do:
- (a)
Search for \(t_i\) in the training database;

- (b)
If \(t_i\) is found then update the number of messages on the class of \(m\) that \(t_i\) has appeared, otherwise insert \(t_i\) in the database.

- (a)

Anti-spam classifiers generally build their predicting models by learning from examples. A basic training method is to start with an empty model, classify each new sample and train it in the right class if the classification is wrong. This is known as train on error (TOE). An improvement to this method is to train also when the classification is right, but the score is near the boundary—that is, train on near error (TONE). This method is also called thick threshold training [45].

The advantage of TONE over TOE is that it accelerates the learning process by exposing the filter to additional hard-to-classify samples in the same training period. Therefore, we employ the TONE as training method used by the proposed MDL anti-spam filter.

In our evaluations, we empirically set the uncertainty interval (TONE threshold) \(\Sigma = [-0.1, 0.1]\). It means that if \(mdlOutput \in \Sigma \), the training method is requested.

## 4 Experimental results

^{6}Enron corpora tries to keep the same characteristics of a real user mailbox. It is composed by legitimate messages extracted from the mailboxes of six former employees of the Enron Corporation. The composition of each dataset is shown in Table 1. For more details and statistics, refer to [38].

Amount of messages in each Enron dataset

Dataset | No. of legitimate | No. of spam | Total |
---|---|---|---|

Enron 1 | 3,672 | 1,500 | 5,172 |

Enron 2 | 4,361 | 1,496 | 5,857 |

Enron 3 | 4,012 | 1,500 | 5,512 |

Enron 4 | 1,500 | 4,500 | 6,000 |

Enron 5 | 1,500 | 3,675 | 5,175 |

Enron 6 | 1,500 | 4,500 | 6,000 |

Total | 16,545 | 17,171 | 33,716 |

Enron 1: results achieved by each filter

Classifiers | Sre (%) | Spr (%) | Lre (%) | Lpr (%) | \(\text{ Acc}_\mathrm{w}\) (%) | TCR | MCC | \(T\) (s) |
---|---|---|---|---|---|---|---|---|

Basic NB | 91.33 | 85.09 | 93.48 | 96.36 | 92.86 | 4.054 | 0.831 | 84 |

MN TF NB | 82.00 | 73.21 | 87.77 | 92.29 | 86.10 | 2.083 | 0.676 | 94 |

MN Bool NB | 82.67 | 60.19 | 77.72 | 91.67 | 79.15 | 1.389 | 0.560 | 84 |

MV Bern NB | 72.67 | 60.56 | 80.71 | 87.87 | 78.38 | 1.339 | 0.508 | 84 |

Bool NB | 96.00 | 52.55 | 64.67 | 97.54 | 73.75 | 1.103 | 0.551 | 84 |

Gauss NB | 85.33 | 89.51 | 95.92 | 94.13 | 92.86 | 4.054 | 0.824 | 126 |

Flex Bayes | 86.67 | 88.44 | 95.38 | 94.61 | 92.86 | 4.054 | 0.825 | 132 |

SVM | 83.33 | 87.41 | 95.11 | 93.33 | 91.70 | 3.488 | 0.796 | 248 |

MDL | 92.00 | 92.62 | 97.01 | 96.75 |
| 6.552 |
| 83 |

Enron 2: results achieved by each filter

Classifiers | Sre (%) | Spr (%) | Lre (%) | Lpr (%) | \(\text{ Acc}_\mathrm{w}\) (%) | TCR | MCC | \(T\) (s) |
---|---|---|---|---|---|---|---|---|

Basic NB | 80.00 | 97.56 | 99.31 | 93.53 | 94.38 | 4.545 | 0.850 | 150 |

MN TF NB | 75.33 | 96.58 | 99.08 | 92.13 | 93.02 | 3.659 | 0.812 | 166 |

MN Bool NB | 76.00 | 98.28 | 99.54 | 92.36 | 93.53 | 3.947 | 0.827 | 150 |

MV Bern NB | 66.00 | 81.82 | 94.97 | 89.06 | 87.56 | 2.055 | 0.657 | 152 |

Bool NB | 95.33 | 81.25 | 92.45 | 98.30 | 93.19 | 3.750 | 0.836 | 150 |

Gauss NB | 75.33 | 98.26 | 99.54 | 92.16 | 93.36 | 3.846 | 0.823 | 208 |

Flex Bayes | 64.00 | 97.96 | 99.54 | 88.96 | 90.46 | 2.679 | 0.743 | 220 |

SVM | 90.67 | 90.67 | 96.80 | 96.80 | 95.23 | 5.357 | 0.875 | 401 |

MDL | 91.33 | 99.28 | 99.77 | 97.10 |
| 10.714 |
| 148 |

Enron 3: results achieved by each filter

Classifiers | Sre (%) | Spr (%) | Lre (%) | Lpr (%) | \(\text{ Acc}_\mathrm{w}\) (%) | TCR | MCC | \(T\) (s) |
---|---|---|---|---|---|---|---|---|

Basic NB | 58.00 | 100.00 | 100.00 | 86.45 | 88.59 | 2.381 | 0.708 | 52 |

MN TF NB | 62.00 | 100.00 | 100.00 | 87.58 | 89.67 | 2.632 | 0.737 | 60 |

MN Bool NB | 60.00 | 100.00 | 100.00 | 87.01 | 89.13 | 2.500 | 0.723 | 52 |

MV Bern NB | 100.00 | 85.23 | 93.53 | 100.00 | 95.29 | 5.769 | 0.893 | 52 |

Bool NB | 95.33 | 87.73 | 95.02 | 98.20 | 95.11 | 5.556 | 0.881 | 52 |

Gauss NB | 55.33 | 97.65 | 99.50 | 85.65 | 87.50 | 2.174 | 0.676 | 82 |

Flex Bayes | 52.67 | 97.53 | 99.50 | 86.78 | 72.83 | 2.055 | 0.656 | 90 |

SVM | 91.33 | 96.48 | 98.76 | 96.83 | 96.74 | 8.333 | 0.917 | 159 |

MDL | 90.00 | 100.00 | 100.00 | 96.40 |
| 10.000 |
| 52 |

Enron 4: results achieved by each filter

Classifiers | Sre (%) | Spr (%) | Lre (%) | Lpr (%) | \(\text{ Acc}_\mathrm{w}\) (%) | TCR | MCC | \(T\) (s) |
---|---|---|---|---|---|---|---|---|

Basic NB | 95.33 | 100.00 | 100.00 | 87.72 | 96.50 | 21.429 | 0.914 | 52 |

MN TF NB | 94.00 | 100.00 | 100.00 | 84.75 | 95.50 | 16.667 | 0.893 | 60 |

MN Bool NB | 97.11 | 100.00 | 100.00 | 92.02 | 97.83 | 34.615 | 0.945 | 52 |

MV Bern NB | 98.22 | 100.00 | 100.00 | 94.94 | 98.67 | 56.250 | 0.966 | 52 |

Bool NB | 98.00 | 100.00 | 100.00 | 94.34 | 98.50 | 50.000 | 0.962 | 52 |

Gauss NB | 94.22 | 100.00 | 100.00 | 85.23 | 95.67 | 17.308 | 0.896 | 84 |

Flex Bayes | 95.78 | 100.00 | 100.00 | 88.76 | 96.83 | 23.684 | 0.922 | 91 |

SVM | 98.89 | 100.00 | 100.00 | 96.77 |
| 90.000 |
| 161 |

MDL | 97.11 | 100.00 | 100.00 | 92.02 | 97.83 | 34.615 | 0.945 | 51 |

Enron 5: results achieved by each filter

Classifiers | Sre (%) | Spr (%) | Lre (%) | Lpr (%) | \(\text{ Acc}_\mathrm{w}\) (%) | TCR | MCC | \(T\) (s) |
---|---|---|---|---|---|---|---|---|

Basic NB | 90.22 | 98.81 | 97.33 | 80.22 | 92.28 | 9.200 | 0.832 | 48 |

MN TF NB | 88.59 | 100.00 | 100.00 | 78.12 | 91.89 | 8.762 | 0.832 | 54 |

MN Bool NB | 95.11 | 100.00 | 100.00 | 89.29 | 96.53 | 20.444 | 0.922 | 48 |

MV Bern NB | 98.10 | 91.86 | 78.67 | 94.40 | 92.47 | 9.436 | 0.814 | 48 |

Bool NB | 85.87 | 100.00 | 100.00 | 74.26 | 89.96 | 7.077 | 0.799 | 48 |

Gauss NB | 88.59 | 99.39 | 98.67 | 77.89 | 91.51 | 8.364 | 0.821 | 86 |

Flex Bayes | 91.58 | 98.54 | 96.67 | 82.39 | 93.05 | 10.222 | 0.845 | 92 |

SVM | 89.40 | 99.70 | 99.33 | 79.26 | 92.28 | 9.200 | 0.837 | 166 |

MDL | 99.73 | 98.39 | 96.00 | 99.31 |
| 52.571 |
| 48 |

Enron 6: results achieved by each filter

Classifiers | Sre (%) | Spr (%) | Lre (%) | Lpr (%) | \(\text{ Acc}_\mathrm{w}\) (%) | TCR | MCC | \(T\) (s) |
---|---|---|---|---|---|---|---|---|

Basic NB | 87.78 | 99.00 | 97.33 | 72.64 | 90.17 | 7.627 | 0.781 | 102 |

MN TF NB | 75.78 | 99.42 | 98.67 | 57.59 | 81.50 | 4.054 | 0.651 | 118 |

MN Bool NB | 93.33 | 97.45 | 92.67 | 82.25 | 93.17 | 10.976 | 0.828 | 102 |

MV Bern NB | 96.00 | 92.31 | 76.00 | 86.36 | 91.00 | 8.333 | 0.753 | 106 |

Bool NB | 66.67 | 99.67 | 99.33 | 49.83 | 74.83 | 2.980 | 0.572 | 102 |

Gauss NB | 89.56 | 98.05 | 94.67 | 75.13 | 90.83 | 8.182 | 0.785 | 162 |

Flex Bayes | 94.22 | 97.03 | 91.33 | 84.05 | 93.50 | 11.538 | 0.833 | 180 |

SVM | 89.78 | 95.28 | 86.67 | 73.86 | 89.00 | 6.818 | 0.727 | 316 |

MDL | 98.67 | 95.48 | 86.00 | 95.56 |
| 16.667 |
| 99 |

The results achieved by the proposed MDL anti-spam filter are compared with the ones attained by methods considered the actual top performers in anti-spam filtering: seven different models of NB classifiers (Basic NB [42], multinomial term frequency NB [MN TF NB] [37], multinomial Boolean NB [MN Bool NB] [44], multivariate Bernoulli NB [MV Bern NB] [35], Boolean NB [Bool NB] [38], multivariate Gauss NB [Gauss NB] [38], and flexible Bayes [Flex Bayes] [32]) and linear SVM with Boolean attributes [25, 26, 30, 34].

It is important to point out that all the Naïve Bayes classifiers and the proposed MDL spam filter were implemented in Matlab. For implementation details and parameters of each Naïve Bayes approach, refer [1]. On the other hand, the evaluated SVM technique is provided by the well-known LibSVM toolbox.^{7} We have used the linear kernel with the default parameters, as suggested by Kolcz and Alspector [34]. Regarding the training method, it is important to note that only the MDL classifier employs the train on near error strategy with uncertainty interval \(\Sigma = [-0.1\; 0.1]\).

A comprehensive set of results, including all tables and figures, is available at http://www.dt.fee.unicamp.br/~tiago/research/spam/.

Regarding the results achieved by the classifiers, the MDL spam filter outperformed the other methods for the majority e-mail collections used in our empirical evaluation. It is important to realize that in some situations, the MDL performs much better than SVM and NB classifiers. For instance, for Enron 1 (Table 2), MDL achieved spam recall rate equal to 92 % while SVM attained 83.33 %, even though MDL presented better legitimate recall. It means that for Enron 1 MDL was able to recognize over 8 % more of spam than SVM, representing an improvement of 10.40 %. In a real situation, this difference would be extremely important. Note that, the same result can be found for Enron 2 (Table 3), Enron 5 (Table 6), and Enron 6 (Table 7). Both methods, MDL and SVM, achieved similar performance with no significant statistical difference just for Enron 3 (Table 4) and Enron 4 (Table 5).

The results indicate that the data compression model is more efficient to distinguish messages as spams or hams than other compared spam filters. It achieved an impressive average accuracy rate higher than 97 % and high precision \(\times \) recall rates for all datasets indicating that the MDL classifier makes few mistakes. We also verify that the MDL classifier achieved an average MCC score higher than 0.925 for all tested e-mail collections. It clearly indicates that the proposed filter almost accomplished a perfect prediction (MCC = 1.000) and it is much better than not using a filter (MCC = 0.000).

Among the evaluated NB classifiers, the results indicate that all of them achieved similar performance with no significant statistical difference. However, they achieved lower results than MDL and SVM, which attained accuracy rate higher than 90 % for the most of Enron datasets.

Nevertheless, it is important to note that TCR is really not an informative measurement, as previously argued by Metsis et al. [38] and Cormack [22]. For instance, for Enron 4 (Table 5), MDL and SVM achieved similar performances (\(\text{ MCC}_\mathrm{MDL} = 0.945\) and \(\text{ MCC}_\mathrm{SVM} = 0.978\)). However, their TCR are very different (\(\text{ TCR}_\mathrm{MDL} = 34.615\) and \(\text{ TCR}_\mathrm{SVM} = 90.000\)), besides their precision \(\times \) recall rates are very close.

Regarding time performance, in all tests, the MDL spam filter spent less or equal time than other compared techniques. Note that, the time consumed by the methods to process Enron 2 (Table 3) and 6 (Table 7) is higher than other Enron databases. It is because such datasets are composed by a lot of large messages, with big content.

## 5 Conclusions and further work

In this paper, we have presented a new spam filtering approach based on the MDL principle that has proved to be very fast to construct and incrementally updateable. We have also compared its performance with the well-known linear SVM and seven different models of Naïve Bayes classifiers, something the spam literature does not always acknowledge.

To evaluate the proposed approach we have conducted empirical experiments using well-known, real, large and public databases and the reported results indicate that the proposed classifier outperforms currently established spam filters. It is important to emphasize that MDL spam filter has obtained the best average performance for all analyzed collections presenting an average accuracy rate higher than 97 % for all e-mail datasets.

Actually, we are conducting more experiments to compare our approach with other compression-based methods, along with commercial and open-source anti-spam filters, such as DMC, PPM, Bogofilter, SpamAssassin and ProcMail, among others.

Future works include evaluating the MDL spam filter to classify messages in environments where the text have rigid restriction in length, such as SMS spam [6], blog spam and social network spam, among others.

The LibSVM toolbox for Matlab is free available to download at http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

## Declarations

### Acknowledgments

This work is supported by the Brazilian funding agencies CNPq, CAPES and FAPESP.

## Authors’ Affiliations

## References

- Almeida T, Yamakami A (2010) Content-based spam filtering. In: Proceedings of the 23rd IEEE International Joint Conference on Neural Networks. Barcelona, Spain, pp 1–7
- Almeida T, Yamakami A (2011) Redução de Dimensionalidade Aplicada na Classificação de Spams Usando Filtros Bayesianos. Revista Brasileira de Computação Aplicada 3(1):16–29View ArticleGoogle Scholar
- Almeida T, Yamakami A, Almeida J (2009) Evaluation of approaches for dimensionality reduction applied with Naive Bayes anti-spam filters. In: Proceedings of the 8th IEEE International Conference on Machine Learning and Applications. Miami, FL, USA, pp 517–522
- Almeida T, Yamakami A, Almeida J (2010a) Filtering spams using the minimum description length principle. In: Proceedings of the 25th ACM Symposium on Applied Computing. Sierre, Switzerland, pp 1856–1860
- Almeida T, Yamakami A, Almeida J (2010b) Probabilistic anti-spam filtering with dimensionality reduction. In: Proceedings of the 25th ACM Symposium On Applied Computing. Sierre, Switzerland, pp 1804–1808
- Almeida T, Hidalgo JG, Yamakami A (2011a) Contributions to the study of SMS spam filtering: new collection and results. In: Proceedings of the 2011 ACM Symposium on Document Engineering. Mountain View, CA, USA, pp 259–262
- Almeida T, Almeida J, Yamakami A (2011b) Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers. J Internet Serv Appl 1(3):183–200View ArticleGoogle Scholar
- Almeida TA, Yamakami A (2012a) Advances in spam filtering techniques. In: Elizondo D, Solanas A, Martinez-Balleste A (eds) Com putational intelligence for privacy and security. Studies in computational intelligence. vol 394. Springer, Berlin, pp 199–214
- Almeida TA, Yamakami A (2012b) Facing the spammers: a very effective approach to avoid junk e-mails. Expert Syst Appl: 1–5
- Anagnostopoulos A, Broder A, Punera K (2008) Effective and efficient classification on a search-engine model. Knowl Inf Syst 16(2):129–154View ArticleGoogle Scholar
- Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C (2000a) An evalutation of Naive Bayesian anti-spam filtering. In: Proceedings of the 11th European Conference on Machine Learning. Barcelona, Spain, pp 9–17
- Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos P (2000b) Learning to filter spam e-mail: a comparison of a Naive Bayesian and a memory-based approach. In: Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases. Lyon, France, pp 1–13
- Androutsopoulos I, Paliouras G, Michelakis E (2004) Learning to filter unsolicited commercial e-mail. Technical Report 2004/2, National Centre for Scientific Research “Demokritos”, Athens, Greece
- Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5):412–424View ArticleGoogle Scholar
- Barron A, Rissanen J, Yu B (1998) The minimum description length principle in coding and modeling. IEEE Trans Inf Theory 44(6):2743–2760MATHMathSciNetView ArticleGoogle Scholar
- Blanzieri E, Bryl A (2008) A survey of learning-based techniques of email spam filtering. Artif Intell Rev 29(1):335–455Google Scholar
- Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel classifiers with online and active learning. J Mach Learn Res 6:1579–1619MATHMathSciNetGoogle Scholar
- Bratko A, Cormack G, Filipic B, Lynam T, Zupan B (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7:2673–2698MATHMathSciNetGoogle Scholar
- Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing. Tzigov Chark, Bulgaria, pp 58–64
- Cohen W (1995) Fast effective rule induction. In: Proceedings of 12th International Conference on Machine Learning. Tahoe City, CA, USA, pp 115–123
- Cohen W (1996) Learning rules that classify e-mail. In: Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access. CA, USA, Stanford, pp 18–25
- Cormack G (2008) Email spam filtering: a systematic review. Found Trends Inf Retr 1(4):335–455MathSciNetView ArticleGoogle Scholar
- Cormack G, Lynam T (2007) Online supervised spam filter evaluation. ACM Trans Inf Syst 25(3):1–11View ArticleGoogle Scholar
- Czarnowski I (2011) Cluster-based instance selection for machine classification. Knowl Inf Syst
- Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054View ArticleGoogle Scholar
- Forman G, Scholz M, Rajaram S (2009) Feature shaping for linear SVM classifiers. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. France, Paris, pp 299–308
- Frank E, Chui C, Witten I (2000) Text categorization using compression models. In: Proceedings of the 10th Data Compression Conference. Snowbird, UT, USA, pp 555–565
- Grünwald P (2005) A tutorial introduction to the minimum description length principle. In: Grünwald P, Myung I, Pitt M (eds) Advances in minimum description length: theory and applications. MIT Press, Cambridge, pp 3–81
- Guzella T, Caminhas W (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10206–10222View ArticleGoogle Scholar
- Hidalgo J (2002) Evaluating cost-sensitive unsolicited bulk mail categorization. In: Proceedings of the 17th ACM Symposium on Applied Computing. Madrid, Spain, pp 615–620
- Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of 14th International Conference on Machine Learning. Nashville, TN, USA, pp 143–151
- John G, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th International Conference on Uncertainty in Artificial Intelligence. Montreal, Canada, pp 338–345
- Katakis I, Tsoumakas G, Vlahavas I (2009) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3):371–391View ArticleGoogle Scholar
- Kolcz A, Alspector J (2001) SVM-based filtering of e-mail spam with content-specific misclassification costs. In: Proceedings of the 1st International Conference on Data Mining. San Jose, CA, USA, pp 1–14
- Losada D, Azzopardi L (2008) Assessing multivariate Bernoulli models for information retrieval. ACM Trans Inf Syst 26(3):1–46View ArticleGoogle Scholar
- Matthews B (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta 405(2):442–451View ArticleGoogle Scholar
- McCallum A, Nigam K (1998) A comparison of event models for Naive Bayes text classication. In: Proceedings of the 15th AAAI Workshop on Learning for Text Categorization. Menlo Park, CA, USA, pp 41–48
- Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with Naive Bayes—which Naive Bayes? In: Proceedings of the 3rd International Conference on Email and Anti-Spam. Mountain View, CA, USA, pp 1–5
- Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst 16(3):281–301View ArticleGoogle Scholar
- Reddy C, Park J-H (2010) Multi-resolution boosting for classification and regression problems. Knowl Inf Syst
- Rissanen J (1978) Modeling by shortest data description. Automatica 14:465–471MATHView ArticleGoogle Scholar
- Sahami M, Dumais S, Hecherman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Proceedings of the 15th National Conference on Artificial Intelligence. Madison, WI, USA, pp 55–62
- Schapire R, Singer Y, Singhal A (1998) Boosting and Rocchio applied to text filtering. In: Proceedings of the 21st Annual International Conference on Information Retrieval. Melbourne, Australia, pp 215–223
- Schneider K (2004) On word frequency information and negative evidence in Naive Bayes text classification. In: Proceedings of the 4th International Conference on Advances in Natural Language Processing. Alicante, Spain, pp 474–485
- Siefkes C, Assis F, Chhabra S, Yerazunis W (2004) Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases. Pisa, Italy, pp 410–421
- Song Y, Kolcz A, Gilez C (2009) Better Naive Bayes classification for high-precision spam detection. Softw Pract Experience 39(11):1003–1024View ArticleGoogle Scholar
- Teahan W, Harper D (2001) Using compression-based language models for text categorization. In: Proceedings of the 2001 Workshop on Language Modeling and Information Retrieval. Pittsburgh, PA, USA, pp 1–5
- Wozniak M (2010) A hybrid decision tree training method using data streams. Knowl Inf Syst
- Wu X, Kumar V, Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan G, Ng A, Liu B, Yu P, Zhou Z, Steinbach M, Hand D, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37Google Scholar
- Zhang J, Kang D, Silvescu A, Honavar V (2006) Learning accurate and concise Naive Bayes classifiers from attribute value taxonomies and data. Knowl Inf Syst 9(2):157–179View ArticleGoogle Scholar
- Zhang L, Zhu J, Yao T (2004) An evaluation of statistical spam filtering techniques. ACM Trans Asian Lang Inf Process 3(4):243–269View ArticleGoogle Scholar