N-Gram and KLD Based Efficient Feature Selection Approach for Text Categorization
Keywords:
Dimensionality Reduction, Feature Selection, Filter, Information gain, Jeffreys divergence, KullbackLeibler Divergence, Maximum Discrimination, Text Categorization, WrapperAbstract
Automated categorization of text into a set of predermined categories has become one of the important
approach for handling and organizing enormous amount of web document. Text categorization method is used in a wide
variety of applications such as news article categorization, spam filtering. To deal with a major challenge in text
categorization that is high dimensionality of the feature space, feature extraction and feature selection plays most
important role. In this paper, TF N-gram based and unigram TF-IDF based method will be carried out for feature
extraction. Then from extracted feature set, feature selection is done by using Kullback-Leibler divergence measure
(KLD). We evaluated the proposed approach on document collections BBC that is news article dataset, originating from
BBC News, using classification algorithm, Naive Bayes. Proposed method outperforms in terms of accuracy of text
categorization.