A STRUCTURAL APPROACH FOR PDF DOCUMENTS CLASSIFICATION
Keywords:
PDF, Machine Learnig, Malware, Objects, Streams, JavaScriptsAbstract
From last few years, the PDF document has proved to be a great acomplishment vector for malicious infections,
making upto 80% of all the exploits found by Cisco ScanSafe. Generating novel PDF files is quite an easy task and the
aggregate PDF documents recognized as the malicious has extended beyond the potential of security analysts to analyse
them manually. The solution proposed by our paper is to automatically extract the features from the PDF files to analyse the
malicious and non malicious behaviour ogf the PDFs and to gather and classify them, so that the resembling malware may
be detected without analysing them manually, hence reducing the workload of the malware experts. The features discussed
may also be studied to determine the trends followed within the PDF files, vairious exploits or obfuscation technique.
Finding homogeneity in PDF files expose the further information about a data set used. In our study we collected dataset
from vairious sources and the tested them for the maliciousness so as to classify them.We also report the performance of
different classifiers. Finally, we believe that, to reduce the targeted attacks, a more cautious machine learning based
detectors are needed.