Machine Learning in Malware Detection

1.0 Background Research

Malware was first created in 1949 by John von Neumann. Ever since then, more and more malwares are created. Antivirus company are constantly looking for a method that is the most effective in detecting malware. One of the most famous method used by antivirus company in detecting malware is the signature based detection. But over the years, the growth of malware is increasing uncontrollably. Until recent year, the signature based detection have been proven ineffective against the growth of malware. In this research, I have chosen another method for malware detection which is implementing machine learning method on to malware detection. Using the dataset that I get from Microsoft Malware Classification Challenge (BIG 2015), I will find an algorithm that will be able to detect malware effectively with low false positive error.

1.1 Problem Statement

With the growth of technology, the number of malware are also increasing day by day. Malware now are designed with mutation characteristic which causes an enormous growth in number of the variation of malware (Ahmadi, M. et al., 2016). Not only that, with the help of automated malware generated tools, novice malware author is now able to easily generate a new variation of malware (Lanzi, A. et al., 2010). With these growths in new malware, traditional signature based malware detection are proven to be ineffective against the vast variation of malware (Feng, Z. et al., 2015). On the other hand, machine learning methods for malware detection are proved effective against new malwares. At the same time, machine learning methods for malware detection have a high false positive rate for detecting malware (Feng, Z. et al., 2015).

1.2 Objective

To investigate on how to implement machine learning to malware detection in order to detection unknown malware. To develop a malware detection software that implement machine learning to detect unknown malware. To validate that malware detection that implement machine learning will be able to achieve a high accuracy rate with low false positive rate.

1.3 Theoretical / Conceptual Framework

1.4 Significance

With Machine Learning in Malware detection that have a high accuracy and low false positive rate, it will help end user to be free from fear malware damaging their computer. As for organization, they will have their system and file to be more secure.

2.0 Literature Review

2.1 Overview

Traditional security product uses virus scanner to detect malicious code, these scanner uses signature which created by reverse engineering a malware. But with malware that became polymorphic or metamorphic the traditional signature based detection method used by anti-virus is no long effective against the current issue of malware (Willems, G., Holz, T. & Freiling, F., 2007). In current anti-malware products, there are two main task to be carried out from the malware analysis process, which are malware detection and malware classification. In this paper, I am focusing on malware detection. The main objective of malware detection is to be able to detect malware in the system. There are two type of analysis for malware detection which are dynamic analysis and static analysis. For effective and efficient detection, the uses of feature extraction are recommended for malware detection (Ahmadi, M. et al., 2016). There are various type of detection method, the method that we are using will be detecting through hex and assembly file of the malware. Feature will be extracted from both hex view and assembly view of malware files. After extracting feature to its category, all category is to be combine into one feature vector for the classifier to run on them (Ahmadi, M. et al., 2016). For feature selection, separating binary file into blocks to be compare the similarities of malware binaries. This will reduce the analysis overhead which cause the process to be faster (Kim, T.G., Kang, B. & Im, E.G., 2013). To build a learning algorithm, feature that are extracted with the label will be undergo classification with using any classification method for example Random Forest, Neural Network, N-gram, KNN and many others, but Support Vector Machine (VCM) is recommended for the presence of noise in the extracted feature and the label (Stewin, P. & Bystrov, I., 2016). As to generate result, the learning model is to test with dataset with label to generate a graph which indicate detection rate and false positive rate. To find the best result, repeat the process using many other classification and create learning model to test on the same dataset. The best result will the one graph that has the highest detection rate and lowest false positive rates (Lanzi, A. et al., 2010).

2.2 Dynamic and Static Analysis

Dynamic Analysis runs the malware in a simulated environment which usually will be a sandbox, then within the sandbox the malware is executed and being observe its behavior. Two approaches for dynamic analysis that is comparing image of the system before and after the malware execution, and monitors the malware action during the execution with the help of a debugger. The first approach usually give a report which will be able to obtain similar report via binary observation while the other approach is more difficult to implement but it gives a more detailed report about the behavior of the malware (Willems, G., Holz, T. & Freiling, F., 2007).

Static Analysis will be studying the malware without executing it which causing this method to be more safe comparing to dynamic analysis. With this method, we will dissemble the malware executable into binary file and hex file. Then study the opcode within both file to compare with a pre-generated opcode profile in order to search for malicious code that exist within the malware executable (Santos, I. et al., 2013).

All malware detection will be needed either Static Analysis or Dynamic Analysis. In this paper, we will be focusing on Static Analysis (Ahmadi, M. et al., 2016). This is because, Dynamic analysis has a drawback, it can only run analysis on 1 malware at a time, making the whole analysis process to take a long time, as we have many malware that needed to be analysis (Willems, G., Holz, T. & Freiling, F., 2007). As for Static Analysis, it mainly uses to analyze hex code file and assembly code file, and compare to Dynamic Analysis, Static Analysis take much short time and it is more convenient to analyze malware file as it can schedule to scan all the file at once even in offline (Tabish, S.M., Shafiq, M.Z. & Farooq, M., 2009).

2.3 Features Extraction

For an effective and efficient classification, it will be wise to extract feature from both hex view file and assembly view file in order to retrieve a complementary date from both hex and assembly view file (Ahmadi, M. et al., 2016).

Few types of feature that are extracted from the hex view file and assembly view file, which is N-gram, Entropy, Image Representative, String Length, Symbol, Operation Code, Register, Application Programming Interface, Section, Data Define, Miscellaneous (Ahmadi, M. et al., 2016). For N-gram feature, it usually used to classify a sequence of action in different areas. The sequence of malware execution could be capture by N-gram during feature extraction (Ahmadi, M. et al., 2016).Ã‚Â For Entropy feature, it extracts the probability of uncertainty in a series of byte in the malware executable file, these probability of uncertainty is depending on the amount of information on the executable file (Lyda, R.,Hamrock, J,. 2007). For Image Representative feature, the malware binary file is being read into 8-bit vector file, then organize into a 2D array file. The 2D array file can be visualize as a black and gray image whereas grey are the bit and byte of the file, this feature look for common in bit arrangement in the malware binary file (Nataraj, L. et al., 2011). For String Length feature, we open each malware executable file and view it in hex view file and extract out all ASCII string from the malware executable, but because it is difficult to only extract the actual string without extract other non-useful element, it is required to choose important string among the extracted (Ahmadi, M. et al., 2016). For Operation Code features, Operation code also known as Opcode are a type of instruction syllable in the machine language. In malware detection, different Opcode and their frequency is extracted and to compare with non-malicious software, different set of Opcodes are identifiable for either malware or non-malware (Bilar, D., n.d.). For Register feature, the number of register usage are able to assist in malware classification as register renaming are used to make malware analysis more difficult to detect it (Christodorescu, M., Song, D. & Bryant, R.E., 2005). For Application Programming Interface feature, API calling are code that call the function of other software in our case it will be Windows API. There are large number of type of API calls in malicious and non-malicious software, is hard to differentiate them, because of this we will be focusing on top frequent used API calls in malware binaries in order to bring the result closer (Top maliciously used apis, 2017). For Data Define feature, because not all of malware contains API calls, and these malware that does not have any API calls they are mainly contain of operation code which usually are db, dw, dd, there are sets of features (DP) that are able to define malware (Ahmadi, M. et al., 2016). For Miscellaneous feature, we choose a few word that most malware have in common from the malware dissemble file (Ahmadi, M. et al., 2016).

Among so many feature, the most appropriate feature for our research will be N-gram, and Opcode. This is because it is proven that there two feature have the highest accuracy with low logloss. This two feature appears frequently in malware file and it already have sets of well-known features for malware. But the drawback using N-gram and Opcode are they require a lot of resource to process and take a lot of time (Ahmadi, M. et al., 2016). We will also try other feature to compare with N-gram and Opcode to verified the result.

2.4 Classification

In this section, we will not review about the algorithm or mathematical formula of a classifier but rather their nature to able to have advantage over certain condition in classifying malware feature. The type of classifier that we will review will be Nearest Neighbor, NaÃƒÂ¯ve Bayes, Decision tree, Support Vector Machine and XGBOOST [21] (Kotsiantis, S.B., 2007) (Ahmadi, M. et al., 2016).

As we need a classifier to train our data with the malware feature, we will need to review the classifier to choose the most appropriate classifier that are able to have the best result. The Nearest Neighbor classifier are one of the simplest method for classifying and it is normally implement in case-based reasoning [21]. As for NaÃƒÂ¯ve Bayes, it usually generates simply and constraint model and not suitable for irregular data input, which make it not suitable for malware classification because that the data in malware classification are not regular (Kotsiantis, S.B., 2007). For Decision Tree, it classify feature by sorting them into tree node base on their feature values and each branch represent the node value. Decision Tree will determine either try or false based on node value, which make it difficult to dealt with unknown feature that are not stored in tree node (Kotsiantis, S.B., 2007). For Support Vector Machine, it has a complexity model which enable it to deal with large amount of feature and still be able to obtain good result from it, which make it suitable for malware classification as malware contains large number of feature (Kotsiantis, S.B., 2007). For XGBOOST, it is a scalable tree boosting system which win many machine learning competition by achieving state of art result. The advantage for XGBOOST, it is suitable for most of any scenario and it run faster than most of other classification technique (Chen, T., n.d.).

Machine Learning in Malware Detection

Order Now