A REVIEW OF FEATURE EXTRACTION METHODS FOR TEXT CLASSIFICATION
Keywords:
Natural Language Processing, Feature Extraction, Classification, Bag of Words, TF-IDF, Word2Vec, Logistic Regression, Random Forest ClassifierAbstract
Natural Language Processing (NLP) and Machine Learning concepts are acclaimed in today’s digitalization
of data. Over the time, value of the data keeps changing and it is important to tackle that value for performing in depth
research in various domains. Over the past decade, natural language processing has gained much importance because it
reveals a lot of unseen information in the texts. It is difficult to discover the information of interest from a huge volume of
the text data. Thus, information extraction based on computational text processing is necessary. For many of information
management goals, the task of recognising phrases and words in free text which falls under particular classes of interest
is an important first step. It is crucial to manage huge amount of text being generated dramatically. The text can be for
example clinical and biomedical text. Features can be extracted for classification of the documents. Feature extraction is
extracting an important subset of features from a data for improving the classification task. Correctly identifying the
related features in a text is important. Therefore, applying and expanding NLP techniques can help to better understand
and study the data. This paper aims at analysing the clinical literature for cancer. The feature extraction methods such
as bag of words, tf-idf, word2vec are compared for clinical text analysis. The extracted features are evaluated against
Logistic Regression and Random Forest Classifier.