Machine Learning and Images for Malware Detection and Classification
School of Science and Technology, MSc in Communications and Cybersecurity
Neural networks are becoming increasingly popular recently and are both the most discussed and least well-understood branch of machine learning.
Aside from popularized applications to context awareness, they have shown good experimental results in malware/anomaly detection, APT protection, and spam/phishing detection.
Detecting malicious code with exact match on collected datasets is becoming a large-scale identification problem, made harder by the day by new malware variants.
The aim was three-fold: (1) reimplement, evaluate and benchmark the existing literature, (2) design and implement a comparison framework capturing the diffentiating criteria, (3) report on the outcomes of the testing process. More precisely is to build a classifier tool that can classify malware samples automatically and has something like a memory and can organize labels-classes that the classifier has not yet processed or learned. First, it is necessary to know the right classes (called labels) in the training set. There is a need for algorithms that can learn and can remember previous testing and experiments. Testing and comparing algorithms are done using a test set, for which the labels are known. Many algorithms also use a validation set (mainly part of the labeled training set) to manage its learning process. Expected outcomes are the malware images being classified as the first dataset and arranged within the same subfolders with the same labels or at least be as near as possible to the first that is the main reason this dataset were being chosen in this dissertation to know the best-classified outcome. The goal of clustering algorithms is to test and examine if the same clusters as the original dataset can be recreated and achieved. So, the experiments try to reform the same groups from malware samples after the data is shuffled by the current that time algorithm.
The goal in general is to extend and improve the system by:
1. Performing malware detection.
2. Performing classification of malware families.
3. Finding new and improving old features.
4. Applying a feature selection algorithm, that will select the most discriminative features.
5. Building an extensive database of malware by collecting more samples.
6. Retrieving a uniform sample set among the malware classes.
Malware is characterized based on image feature descriptor and malware executables are converted to images. Performance proposed and presented for malware classification and clustering is promising. The dataset used for demonstration is the Malimg Dataset, from the paper Nataraj et al., 2011 Malware Images: Visualization and Automatic Classification. This dataset comprises 25 malware families with varying number of variants per family.
Classification Algorithms developed were Support Vector Machine, Decision Tree, Random Forest, Perceptron, Multilayer Perceptron, Stochastic Gradient, Multinomial Naive Bayes, BernoulliRBM, Nearest Centroid and the Clustering Algorithms developed, were DBScan, Meanshift, Kmeans-MiniBatchKmeans.