N-Gram and KLD Based Efficient Feature Selection Approach for Text Categorization

Smita Shedbale; Dr. Kailash Shaw

Authors

Smita Shedbale ME Student, Department of Computer Engineering, DYPCOE, Akurdi, SPPU, Pune, India
Dr. Kailash Shaw Associate Professor, Department of Computer Engineering, DPCOE, Akurdi, SPPU, Pune, India

Keywords:

Dimensionality Reduction, Feature Selection, Filter, Information gain, Jeffreys divergence, KullbackLeibler Divergence, Maximum Discrimination, Text Categorization, Wrapper

Abstract

Automated categorization of text into a set of predermined categories has become one of the important
approach for handling and organizing enormous amount of web document. Text categorization method is used in a wide
variety of applications such as news article categorization, spam filtering. To deal with a major challenge in text
categorization that is high dimensionality of the feature space, feature extraction and feature selection plays most
important role. In this paper, TF N-gram based and unigram TF-IDF based method will be carried out for feature
extraction. Then from extracted feature set, feature selection is done by using Kullback-Leibler divergence measure
(KLD). We evaluated the proposed approach on document collections BBC that is news article dataset, originating from
BBC News, using classification algorithm, Naive Bayes. Proposed method outperforms in terms of accuracy of text
categorization.

N-Gram and KLD Based Efficient Feature Selection Approach for Text Categorization

Authors

Keywords:

Abstract

Downloads

Published

Issue

Section

How to Cite

Similar Articles

Most read articles by the same author(s)

Make a Submission

downloads

Imp links

google

Latest publications

Information