Skip to main content

Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers

Abstract

E-mail spam has become an increasingly important problem with a big economic impact in society. Fortunately, there are different approaches allowing to automatically detect and remove most of those messages, and the best-known techniques are based on Bayesian decision theory. However, such probabilistic approaches often suffer from a well-known difficulty: the high dimensionality of the feature space. Many term-selection methods have been proposed for avoiding the curse of dimensionality. Nevertheless, it is still unclear how the performance of Naive Bayes spam filters depends on the scheme applied for reducing the dimensionality of the feature space. In this paper, we study the performance of many term-selection techniques with several different models of Naive Bayes spam filters. Our experiments were diligently designed to ensure statistically sound results. Moreover, we perform an analysis concerning the measurements usually employed to evaluate the quality of spam filters. Finally, we also investigate the benefits of using the Matthews correlation coefficient as a measure of performance.

References

  1. 1.

    Almeida T, Yamakami A (2010) Content-based spam filtering. In: Proceedings of the 23rd IEEE international joint conference on neural networks, Spain, Barcelona, pp 1–7

    Google Scholar 

  2. 2.

    Almeida T, Yamakami A, Almeida J (2009) Evaluation of approaches for dimensionality reduction applied with Naive Bayes anti-spam filters. In: Proceedings of the 8th IEEE international conference on machine learning and applications, Miami, FL, USA, pp 517–522

    Google Scholar 

  3. 3.

    Almeida T, Yamakami A, Almeida J (2010) Filtering spams using the minimum description length principle. In: Proceedings of the 25th ACM symposium on applied computing, Sierre, Switzerland, pp 1856–1860

    Google Scholar 

  4. 4.

    Almeida T, Yamakami A, Almeida J (2010) Probabilistic anti-spam filtering with dimensionality reduction. In: Proceedings of the 25th ACM symposium on applied computing, Sierre, Switzerland, pp 1802–1806

    Google Scholar 

  5. 5.

    Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C (2000) An evaluation of Naive Bayesian anti-spam filtering. In: Proceedings of the 11st European conference on machine learning, Barcelona, Spain, pp 9–17

    Google Scholar 

  6. 6.

    Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos P (2000) Learning to filter spam e-mail: a comparison of a Naive Bayesian and a memory-based approach. In: Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases, Lyon, France, pp 1–13

    Google Scholar 

  7. 7.

    Androutsopoulos I, Paliouras G, Michelakis E (2004) Learning to filter unsolicited commercial e-mail. Technical Report 2004/2, National Centre for Scientific, Research “Demokritos”, Athens, Greece

  8. 8.

    Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5):412–424

    Article  Google Scholar 

  9. 9.

    Bratko A, Cormack G, Filipic B, Lynam T, Zupan B (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7:2673–2698

    MATH  MathSciNet  Google Scholar 

  10. 10.

    Carpinter J, Hunt R (2006) Tightening the Net: a review of current and next generation spam filtering tools. Comput Secur 25(8):566–578

    Article  Google Scholar 

  11. 11.

    Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of the 4th international conference on recent advances in natural language processing, Tzigov Chark, Bulgaria, pp 58–64

    Google Scholar 

  12. 12.

    Cohen W (1995) Fast effective rule induction. In: Proceedings of 12nd international conference on machine learning, Tahoe City, CA, USA, pp 115–123

    Google Scholar 

  13. 13.

    Cohen W (1996) Learning rules that classify e-mail. In: Proceedings of the AAAI spring symposium on machine learning in information access, Stanford, CA, USA, pp 18–25

    Google Scholar 

  14. 14.

    Cormack G (2008) Email spam filtering: a systematic review. Found Trends Inf Retr 1(4):335–455

    MathSciNet  Article  Google Scholar 

  15. 15.

    Cormack G, Lynam T (2007) Online supervised spam filter evaluation. ACM Trans Inf Syst 25(3):1–11

    Article  Google Scholar 

  16. 16.

    Cunningham P, Nowlan N, Delany S, Haahr M (2003) A case-based approach to spam filtering that can track concept drift. In: Proceedings of the 5th international conference on case based reasoning. Trondheim, Norway, pp 115–123

    Google Scholar 

  17. 17.

    Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MATH  MathSciNet  Google Scholar 

  18. 18.

    Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054

    Article  Google Scholar 

  19. 19.

    Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    MATH  Google Scholar 

  20. 20.

    Forman G, Kirshenbaum E (2008) Extremely fast text feature extraction for classification and indexing. In: Proceedings of 17th ACM conference on information and knowledge management, Napa Valley, CA, USA, pp 1221–1230

    Google Scholar 

  21. 21.

    Forman G, Scholz M, Rajaram S (2000) Feature shaping for linear SVM classifiers. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. Paris, France, pp 299–308

    Google Scholar 

  22. 22.

    Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29(3):131–163

    MATH  Article  Google Scholar 

  23. 23.

    Fuhr N, Buckley C (1991) A probabilistic learning approach for document indexing. ACM Trans Inf Syst 9(3):223–248

    Article  Google Scholar 

  24. 24.

    Galavotti L, Sebastiani F, Simi M (2000) Experiments on the use of feature selection and negative evidence in automated text categorization. In: Proceedings of 4th European conference on research and advanced technology for digital libraries, Lisbon, Portugal, pp 59–68

    Google Scholar 

  25. 25.

    Guzella T, Caminhas W (2000) A review of machine learning approaches to spam filtering. Exp Syst Appl 36(7):10206–10222

    Article  Google Scholar 

  26. 26.

    Hidalgo J (2002) Evaluating cost-sensitive unsolicited bulk email categorization. In: Proceedings of the 17th ACM symposium on applied computing, Madrid, Spain, pp 615–620

    Google Scholar 

  27. 27.

    Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of 14th international conference on machine learning, Nashville, TN, USA, pp 143–151

    Google Scholar 

  28. 28.

    John G, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11st international conference on uncertainty in artificial intelligence, Montreal, Canada, pp 338–345

    Google Scholar 

  29. 29.

    John G, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of 11st international conference on machine learning, New Brunswick, NJ, USA, pp 121–129

    Google Scholar 

  30. 30.

    Kira K, Rendell L (1992) A practical approach to feature selection. In: Proceedings of the 9th international workshop on machine learning, Aberdeen, Scotland, UK, pp 249–256

    Google Scholar 

  31. 31.

    Kolcz A, Alspector J (2001) SVM-based filtering of e-mail spam with content-specific misclassification costs. In: Proceedings of the 1st international conference on data mining, San Jose, CA, USA, pp 1–14

    Google Scholar 

  32. 32.

    Koprinska I, Poon J, Clark J, Chan J (2007) Learning to classify e-mail. Inf Sci 177(10):2167–2187

    Article  Google Scholar 

  33. 33.

    Lemire D (2005) Scale and translation invariant collaborative filtering systems. Inf Retr 8(1):129–150

    MathSciNet  Article  Google Scholar 

  34. 34.

    Losada D, Azzopardi L (2008) Assessing multivariate Bernoulli models for information retrieval. ACM Trans Inf Syst 26(3):1–46

    Article  Google Scholar 

  35. 35.

    Marsono M, El-Kharashi N, Gebali F (2009) Targeting spam control on middleboxes: spam detection based on layer-3 e-mail content classification. Comput Netw 53(6):835–848

    MATH  Article  Google Scholar 

  36. 36.

    Matthews B (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405(2):442–451

    Article  Google Scholar 

  37. 37.

    McCallum A, Nigam K (1998) A comparison of event models for Naive Bayes text classification. In: Proceedings of the 15th AAAI workshop on learning for text categorization, Menlo Park, CA, USA, pp 41–48

    Google Scholar 

  38. 38.

    Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with Naive Bayes—which Naive Bayes. In: Proceedings of the 3rd international conference on email and anti-spam, Mountain View, CA, USA, pp 1–5

    Google Scholar 

  39. 39.

    Mitchell T (1997) Machine learning. McCraw-Hill, New York

    Google Scholar 

  40. 40.

    Sahami M, Dumais S, Hecherman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Proceedings of the 15th national conference on artificial intelligence, Madison, WI, USA, pp 55–62

    Google Scholar 

  41. 41.

    Schapire R, Singer Y, Singhal A (1998) Boosting and Rocchio applied to text filtering. In: Proceedings of the 21st annual international conference on information retrieval, Melbourne, Australia, pp 215–223

    Google Scholar 

  42. 42.

    Schneider K (2003) A comparison of event models for Naive Bayes anti-spam e-mail filtering. In: Proceedings of the 10th conference of the European chapter of the association for computational linguistics, Budapest, Hungary, pp 307–314

    Google Scholar 

  43. 43.

    Schneider K (2004) On word frequency information and negative evidence in Naive Bayes text classification. In: Proceedings of the 4th international conference on advances in natural language processing, Alicante, Spain, pp 474–485

    Google Scholar 

  44. 44.

    Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  45. 45.

    Seewald A (2007) An evaluation of Naive Bayes variants in content-based learning for spam filtering. Int Data Anal 11(5):497–524

    Google Scholar 

  46. 46.

    Song Y, Kolcz A, Gilez C (2009) Better Naive Bayes classification for high-precision spam detection. Softw Pract Exp 39(11):1003–1024

    Article  Google Scholar 

  47. 47.

    Van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, London

    Google Scholar 

  48. 48.

    Yang Y, Pedersen J (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, Nashville, TN, USA, pp 412–420

    Google Scholar 

  49. 49.

    Zadeh L (1965) Fuzzy sets. Inf Control 8(3):338–353

    MATH  MathSciNet  Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Tiago A. Almeida.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Almeida, T.A., Almeida, J. & Yamakami, A. Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers. J Internet Serv Appl 1, 183–200 (2011). https://doi.org/10.1007/s13174-010-0014-7

Download citation

Keywords

  • Dimensionality reduction
  • Spam filter
  • Text categorization
  • Classification
  • Machine learning