N-Gram and KLD Based Efficient Feature Selection Approach for Text Categorization

Authors

  • Smita Shedbale ME Student, Department of Computer Engineering, DYPCOE, Akurdi, SPPU, Pune, India
  • Dr. Kailash Shaw Associate Professor, Department of Computer Engineering, DPCOE, Akurdi, SPPU, Pune, India

Keywords:

Dimensionality Reduction, Feature Selection, Filter, Information gain, Jeffreys divergence, KullbackLeibler Divergence, Maximum Discrimination, Text Categorization, Wrapper

Abstract

Automated categorization of text into a set of predermined categories has become one of the important
approach for handling and organizing enormous amount of web document. Text categorization method is used in a wide
variety of applications such as news article categorization, spam filtering. To deal with a major challenge in text
categorization that is high dimensionality of the feature space, feature extraction and feature selection plays most
important role. In this paper, TF N-gram based and unigram TF-IDF based method will be carried out for feature
extraction. Then from extracted feature set, feature selection is done by using Kullback-Leibler divergence measure
(KLD). We evaluated the proposed approach on document collections BBC that is news article dataset, originating from
BBC News, using classification algorithm, Naive Bayes. Proposed method outperforms in terms of accuracy of text
categorization.

Published

2017-06-25

How to Cite

Smita Shedbale, & Dr. Kailash Shaw. (2017). N-Gram and KLD Based Efficient Feature Selection Approach for Text Categorization. International Journal of Advance Engineering and Research Development (IJAERD), 4(6), 660–668. Retrieved from https://www.ijaerd.org/index.php/IJAERD/article/view/3038

Most read articles by the same author(s)