TELKOMNIKA Telecommunication Computing Electronics and Control

Received Aug 27, 2020 Revised Dec 04, 2021 Accepted Dec 13, 2021 Recently, the ransomware attack posed a serious threat that targets a wide range of organizations and individuals for financial gain. So, there is a real need to initiate more innovative methods that are capable of proactively detect and prevent this type of attack. Multiple approaches were innovated to detect attacks using different techniques. One of these techniques is machine learning techniques which provide reasonable results, in most attack detection systems. In the current article, different machine learning techniques are tested to analyze its ability in a detection ransomware attack. The top 1000 features extracted from raw byte with the use of gain ratio as a feature selection method. Three different classifiers (decision tree (J48), random forest, radial basis function (RBF) network) available in Waikato Environment for Knowledge Analysis (WEKA) based machine learning tool are evaluated to achieve significant detection accuracy of ransomware. The result shows that random forest gave the best detection accuracy almost around 98%.


INTRODUCTION
Ransomware is a type of malicious software that blocks users from accessing their device or personal data and requests ransom payment to gain access to their device. Since the first appearance of this kind in late of the 1980s till now, the ransomware witnessed a serious development that enabled the hackers to move from the personal blackmail to a high level of corporate blackmail. Therefore detecting this type of attack is a difficult technical problem [1]. The estimated cost of ransomware damage for 2017 was estimated at $5 billion, and 2019 is expected to hit $11.5 billion [2]. The Herjavec Group estimated that cybercrime will cost USD 6 trillion by 2021 [3]. In addition to major financial losses, since 2017 the risk of victimization of ransomware has risen by 97 percent [4] and the trend continues, reported that by the end of 2019 ransomware will strike a company every 14 s dropping to 11s by 2021.
In the current paper, static analysis to detect ransomware attack by extracting features directly from binary files of 32 bits size in the reprocessing stage. A gain ratio feature selection method has been used to select the best features that can be used to distinguish between ransomware and goodware samples. Besides, three different classification models have been used namely; (decision tree (J48), random forest (RF), radial basis function network (RBF)) which used the supervised learning algorithms.
The classification models are trained using 50 percent of collected ransomware files and goodware files, while the other 50 percent group is used for testing the models. The results revealed that random forest classifier is more effective in term of accuracy and time consuming compared to other classification models.The remaining parts of the article are organized as follows: section 2 addresses the related work in TELKOMNIKA Telecommun Comput El Control  Comparative analysis of various machine learning algorithms for … (Ban Mohammed Khammas) 45 gradient tree boosting algorithm has been got a detection rate around 98.25%. Meanwhile, Subedi et al. [18] developed an analysis tool named crypt-ransomware-static (CRSTATIC) which create dynamic-link library (DLLs) libraries from input binary programs. A data-mining technique was used to generate association rules of these DLLs. Ferrante et al. [19] also built a hybrid system contained the static detection method and a dynamic detection method. The static approach utilized the frequency of opcodes, while the dynamic detection method utilized system call statistics, memory usage, central processing unit (CPU) usage, and network usage to detect android ransomware. The false-positive rate attained less than 4%. The motivation of the current study is to analyze the ability of machine learning to detect ransomware using features extracted directly from the binary file, and the top frequent features extracted from ransomware files have been added to the top frequent features extracted from snort malware signatures.

METHODOLOGY
There is a need for a new technique that can be used in advanced security equipment which can be able to detect the security threats of ransomware. This article investigates the ability of machine learning techniques to detect ransomware by comparing three different classifiers using the proposed approach. The proposed approach, as shown in Figure 1, included three major stages. The first stage comprised a preprocessing of the dataset, while the second stage involved feature selection. The third stage implicated the use of three different classifiers to detect ransomware. In the proposed novel method, the features are extracted directly from binary files with the use of static analysis and eliminate the step of disassembling to get the opcode features. Then a preprocessing step is used to prepares the dataset and create the features vectors. This step is essentially needed because some of the symbolic features included in the raw dataset prohibiting the classifier to process this data. In the pre-processing step, the symbolic features are eliminated or changed as they do not signify crucial involvement in attack detection. Besides, these features involve undesirable effects such as increasing training time, wasted computing resources and memory, and further complexity to classifier's architecture [20].
The pre-processing step involved several sub-process. First, the raw bytes in each file is divided into a fixed-size sliding window (32-bits) in order to extract the features, since dealing with bytes is more straightforward and faster than using opcode features [21]- [23]. The feature size of 32 bit has been adopted in the current study because it produces significant results in malware detection [24]- [27]. Secondly, a counting process for the frequency of each feature in these files is implementing. According to Homayoun et al. [13], there are common features available in each ransomware family. Therefore, the current work focused to select these important features by analyzing each ransomware file using the counting process. The third sub-process is a normalization step which is necessary to create the feature vectors according as shown in (1).
Where is the normalized frequency, ∑ ℎ, ℎ is the total number of features in a file, and , is the frequency of specific features.
The second stage of the proposed method is the feature selection process which is considered an important part of the machine learning technique. It's generally used for improving the effectiveness of all the data mining algorithms and the performance of data classification [28]. The major function of feature selection is minimizing the dimensionality of features by eliminating irrelevant features. In current work, the gain ratio (GR) feature selection method has been employed where the top 1000 features are selected based on this feature selection method.
The third stage in the current proposed approach is the classification process. Three different classifiers are examined in order to find the best classifier for the detection of ransomware. These classifiers comprising decision tree (J48), random forest (RF), and radial basis functions (RBF) which have been applied using WEKA tool (an open-source graphical user interface (GUI) based machine learning tool). The decision tree is an algorithm that creates a hierarchical set of rules based on minimizing classification error developed by Quinlan [29]. The random forest algorithm is combining the results of many decision trees in order to identify the optimal set of rules that minimize the classification error. It randomly selects subsamples of features iteratively to train multiple decision trees and then built the classifier which can predict in the testing phase [30]- [32].
The radial basis functions (RBF) is a supervised learning technique that minimizing squared error. It is a neural network that has radially symmetric functional activations in the hidden layer, which means its output depends on the distance between the input data vector and the weight vector, called the center [33]. The fitness function measured is utilized to reach the best accuracy in radial basis function network (RBFN). Many fitness functions can be used to measure an error. The mean square error (MSE) has been used in current research. The pseudo-code of the proposed method which describes the procedure of selecting the important features and the pseudo-code for the comparison of the machine learning models is illustrated as shown in Algorithm 1 and Algorithm 2 respectively.

DATASET COLLECTION
Two types of executable files are used in the present study: ransomware executable files and goodware executable files. The ransomware files are downloaded from virustotal [34], while the goodware files are collected from the portable apps platform [35] and windows platform. The total number of ransomware files is 840 from three different families of ransomware; Cerber, Locky, and TeslaCrypt similar to [36]. The collected goodware files have almost the same size as ransomware files and the same number of 840 files. Virustotal.com has been used to check the goodware and ransomware. 50% of the dataset is used in the training stage, while the rest 50% of the dataset is used in the testing stage in order to avoid the problem of the imbalanced dataset. In the present work, two operating systems have been used to implement the proposed method and getting the results. The first one is Windows 10, Core i7 CPU with 8 core, and 16 GB of RAM. The second operating system is Linux 4.1.

EXPERIMENTAL RESULTS AND ANALYSIS
One of the challenges that face the researchers in the detection system is the scalability which involves; high storage requirements, more-time for implementation, and complexity. To avoid the scalability effects, different sizes of attributes are tested using GR to find the best size that offers higher accuracy in reasonable feature size. The number of 1000 attributes is found to be the best in terms of accuracy and time-consume. Figure 2 shows the simulation of the training and testing stages for the classifiers used in the proposed method.
In order to study the effectiveness of the classifiers, the false positive ratio (FPR), false negative ratio (FNR), true negative ratio (TNR), true positive ratio (TPR), and accuracy have been used in current work [36], as follows:  To measure the accuracy of detection for different classifiers, the experiments are set to the default number for all the parameters of the different classifiers. The result of the detection accuracy using a different number of the attribute (from 1000 to 7000) is shown in Figure 3 which illustrates the best accuracy (97.73%) when using RF with 1000 attributes. Figure 4 shows the time needs for different classifiers to predict the testing dataset when the size of attributes is within the range from (1000 to 7000). The results of attributes less than (<1000) and more than (>7000) are not included in the current analysis because the detection accuracy for these ranges is very low for different classifiers. This is in line with [24] which mentioned that using a large number of attributes declines the accuracy to build the classifier model. Figure 4 depicts that the faster classifier in detection is J48 (0.54 sec.) for different sizes of attributes, while RBF shows the highest time (2.2 sec.) for a prediction than RF. Although the RF time prediction (1.49 sec.) is not the lowest, its highest accuracy makes it prevalent over other classifications. Figures 5, 6, 7, and 8 demonstrate the trends of the recall, the precision, f-measure, and receiver oprating characteristic (ROC) respectively, of the different classifiers using the different number of attributes.  Acc. (%)

49
It can be seen that the random forest achieved the best results for all previous parameters for different sizes of attributes as follows: (f-measure is 97.8, recall is 99.8, ROC is 99.6, precision is 95.9). At the same time, the result of the random forest shows that when the number of attributes increases then values of the recall, precision, f-measure, and ROC will be decreased. This finding shows that the number of attributes has a significant effect on the classifier accuracy because some of the irrelevant attributes or features in data can decrease the accuracy [24].
The FNR, FPR, and TNR are shown in Figures 9, 10, 11 respectively. As it is evident, the random forest has the highest TNR (0.957), the lowest FPR (0.043), and the lowest FNR (0.002). To compare the present work with other previous researches, Table 1 shows a comparison with the most related works. It can be seen a privilege of the proposed method over the other methods of [24] and [16].

CONCLUSION
Present work aimed to utilize the ability of machine learning techniques in a detection ransomware attack. The importance of this paper relies on using the features extracted directly from the raw byte of the executable file with the use of machine learning techniques. Three classification algorithms have been utilized in the current study including random forest, J48, and radial basis functions network. Its found that random forest is most precise in detection ransomware using the proposed method. The most suitable size was found to be 1000 attributes in the feature selection process. The results illustrated that the random forest achieved the best results of all the measured parameters for different sizes of attributes as follows: (f-measure is 97.8, recall is 99.8, ROC is 99.6, and precision is 95.9). At the same time, these results revealed that when the number of attributes increases then the values of the recall, precision, f-measure, and ROC will be decreased. This finding referred that the number of attributes has a significant effect on the classifier accuracy because some of the irrelevant attributes or features in data can decrease the accuracy. The privilege of the proposed method is manifested in the direct extraction of features from binary files without the need of using opcode features which takes more time in the reprocessing stage due to the disassemble process.