Authors: Santosh Kosgi, Mohammad Waseem, Arunabha Choudhury, and Satnam Singh
Deep Learning-based methods have been successfully applied to various computer vision and NLP based problems recently [1]. AI researchers have achieved statistically significant improvements in pushing the benchmarks for state of the art algorithms in object detection, language translation, and sentiment analysis. However, the application of Deep Learning in Information Security (InfoSec) is still in its nascent stages. We introduced deep learning and its applications for InfoSec in our article [2]; this blog is a continuation of this topic. Malware detection and network intrusion detection are two such areas where deep learning has shown significant improvements over the rule-based and classic machine learning-based solutions [3]. Specifically, we demonstrate the power of deep neural networks using Tensorflow, Keras to detect obfuscated PowerShell scripts.
PowerShell is a task automation and configuration management framework consisting of a robust command line shell. It was made open sourced and cross-platform compatible by Microsoft since August 2016. PowerShell has been heavily exploited tool in various cyber attacks scenarios. According to a research study by Symantec, nearly 95.4% of all scripts analyzed by Symantec Blue Coat Sandbox were malicious[4]. The Odinaff hacker group leveraged malicious PowerShell scripts as part of its attacks on banks and other financial institutions [5]. One can find many tools like PowerShell Empire[6] and PowerSploit[7] on the internet that can be used for reconnaissance, privilege escalation, lateral movement, persistence, defense evasion, and exfiltration. The adversaries typically use two techniques to evade detection: First, by running fileless malware, they load malicious scripts downloaded from the internet directly into memory, thereby evading Antivirus (AV) file scanning. Secondly, they use obfuscation to make their code challenging to decode, thus making it more difficult for AV or analyst to figure out the intent of the script. Obfuscation of PowerShell scripts for malicious intent is on the rise and task of analyzing them are made even more difficult due to the high flexibility of its syntax. In Acalvio high interaction decoys, we can monitor PowerShell logs, commands, and scripts that the attacker tried to execute in the decoy. We collect these logs and analyze them in real time and detect whether the script is obfuscated or not.
Problem:
For a Windows operating system, Microsoft PowerShell is an ideal candidate for the attacker’s tool. Firstly, it is installed by default in Windows and secondly, attackers are better off using existing tools that allow them to blend well and possibly evade Antivirus (AV). Since PowerShell 3.0 Microsoft has enhanced PowerShell logging considerably. If Script Block Logging is enabled, then one can capture commands and scripts executed through PowerShell in the event logs. These logs can be analyzed to detect and block malicious scripts. Obfuscation is typically used to evade detection. Daniel and Holmes address this problem of detecting obfuscated scripts in their Blackhat paper [8]. They used Logistic Regression classifier with Gradient Descent method to achieve a reasonable classification accuracy to separate the obfuscated script from clean scripts. However, using a deep feed-forward neural network (FNN) may enhance other performance metrics such as precision and recall. Hence in this blog, we decided to use the deep neural network and compared the performance metrics with different machine learning (ML) classifiers.
Dataset
We use the PowerShellCorpus dataset published and open sourced by Daniel [9] for our data experiments. The dataset consists of around ~300k PowerShell scripts scraped from various sources on the internet like Github, PowerShell Gallery, and Technet. Apart from this we also scraped PowerShell scripts from Poshcode [10] and added to the corpus. Finally, we had nearly 3 GB of script data consisting of 300K clean scripts. We have used Invoke-Obfuscation [11] tool to obfuscate the scripts. Once we have obfuscated all scripts using this tool, we have a labeled data set consisting of class label as clean or obfuscated script.
Data Experiments:
All the activities performed by an attacker in a high interaction decoy are malicious. However, obfuscation detection asserts the presence of an advanced attacker. Here is a simple PowerShell command to get a list of processes:
Get-Process| Where($_.Handles -gt 600}| sort Handles| Format – Table
This command may be obfuscated as:
(((“{2}{9}{12}{0}{3}{10}{13}{4}{18}{8}{17}{11}{5}{16}{1}{15}{14}{7}{19}{6}”-f’-P’,’es’,’G’,’rocess8Dy Whe’,’ {‘,’S’,’le’,’-Ta’,’-gt’,’e’,’r’,’y ‘,’t’,’e’,’t’,’ 8Dy Forma’,’ort Handl’,’ 600} 8D’,’RYl_.Handles ‘,’b’)) -crePLACE’8Dy’,[cHar]124-crePLACE’RYl’,[cHar]36) | IEx
This looks suspicious and noisy. Here is another example of a subtle obfuscation for the same command:
&(“{1}{2}{0}”-f ‘s’,’G’,’et-Proces’)| &(“{1}{0}”-f’here’,’W’) {$_.Handles -gt 600} | &(“{1}{0}” -f’ort’,’S’) Handles | .(“{1}{0}{2}”-f ‘-‘,’Format’,’Table’)
This obfuscation makes it hard to detect the intent of PowerShell command/script. Most of the malicious PowerShell scripts in the wild have these kinds of subtle variations that help them to evade AVs easily. Practically, it is nearly impossible for a security analyst to review every PowerShell script to determine whether it is obfuscated or not. Therefore, automating the obfuscation detection is required. One can use a rule-based approach for obfuscation detection; however, it may not detect a lot of obfuscation types, and a large number of rules needs to be manually written by a domain expert. Therefore, a machine learning/deep learning-based solution is an ideal solution for this problem.
Typically, the first step of machine learning is data cleanup and preprocessing. For the obfuscation detection dataset, the data preprocessing is done to remove Unicode characters from a script.
Obfuscated scripts look different from normal scripts, some combination of characters used in obfuscated scripts are not used in normal scripts. So, we use character level representation for all PowerShell scripts instead of word-based representation. Another reason being, in case of PowerShell scripting, sophisticated obfuscation can sometimes completely blur the boundary between words/tokens/identifiers, rendering it useless for any word-based tokenization. Character-based tokenization is also used by security researchers to detect PowerShell obfuscated scripts. Lee Holmes from Microsoft had explored character frequency-based representation and cosine similarity to detect obfuscated scripts in his blog [12]. There are multiple ways in which characters can be vectorized. One hot encoding of characters represents every character by a bit, and the bit is set to 0 or 1 depending upon whether that character is present in the script or not. The classifiers trained with a single character one hot encoding performs well. However, this can be improved by capturing the sequence of characters. For example: command like New-Object may be obfuscated as (‘Ne’+’w-‘+’Objec’+’t’). The character plus (+) operator is common for any PowerShell script. However, plus (+) followed by a single (‘) or double quote (“) may not be as common. Therefore, we use tf-idf character bigrams to represent as the features input to the classifiers.
Here are 20 bigrams with top tf-idf score from the training dataset:
Clean script
[‘er’, ‘te’, ‘in’, ‘at’, ‘re’, ‘pa’, ‘st’, ‘on’, ‘me’, ‘en’, ‘ti’, ‘le’, ‘th’, ‘am’, ‘nt’, ‘es’, ‘se’, ‘or’, ‘ro’, ‘co’]
Obfuscated script
[“‘+”, “+'”, ‘}{‘, “,'”, “‘,”, ‘er’, ‘te’, ‘in’, ‘re’, ‘me’, ‘st’, ‘et’, ‘se’, ‘ar’, ‘on’, ‘at’, ‘ti’, ‘am’, ‘es’, ‘{1’]
Each script is represented using the character bigrams. We process all these features using deep Feed Forward Neural Network (FFN) with N hidden layers using Keras and Tensorflow.
Figure 1: Obfuscation Detection data flow diagram using deep FFN
The data flow diagram as shown in Figure 1 shows the training and prediction flow for obfuscation detection. We have varied the value of hidden layers in the deep FNN and found N=6 to be optimal. For activation, RELU is used for all the hidden layers. Each layer of Hidden layer is dense in nature of dimension 1000 and used a dropout rate of 0.5. For the last layer, sigmoid is used as an activation function. Figure 2 shows the deep FFN network representation for obfuscation detection.
Figure 2: FFN Network Representation for Obfuscation Detection
We see a validation accuracy of nearly 92% that indicates that the model has generalized well outside the training set. Next, we test our model on the test set. We see accuracy of 93% with 0.99 recall for obfuscated class. Figure 2 shows the classification accuracy and classification loss plots for training and validation data for each epoch.
Figure 2: Classification Accuracy and Loss plots for Training and Validation Phase
Table 1 shows the results of deep FNN as compared to other ML models. Performance metrics precision and recall are used to measure the efficacy of the various models.
Table 1: Output of ML Models for Obfuscation Detection.
Classifier Used | Precision | Recall |
Random Forest | 0.92 | 0.97 |
Logistic Regression | 0.91 | 0.87 |
Deep Feed-forward Neural Network (FNN) | 0.89 | 0.99 |
Our objective is to detect most of the obfuscated scripts as the obfuscated script, i.e. we would like to minimize the false negative rate for the obfuscated class. The Recall seems to be the appropriate measure in this case. Table 1 shows that the deep FNN model achieves more recall as compared to other classifiers. The dataset used in our experiments is of medium scale, in the wild, the datasets are typically quite big, and deep FNN performs even better as compared to the other ML classifiers.
Conclusion:
PowerShell obfuscation is a smart way to bypass existing antivirus and hide the attack’s intent; a technique which is used by many adversaries. In this blog, we demonstrated the power of deep learning combined with Acalvio’s deception [13] technology to detect obfuscated PowerShell scripts in a high interaction decoy. Acalvio’s ShadowPlex [14], an autonomous deception platform provides an ability to engage with the adversary, understand his intent, tools and monitor all of his activities. In our next blog of this series, we will share some more use cases where AI and deception can be leveraged for information security.
References:
[1] “Deep Learning,” Ian Goodfellow, Yoshua Bengio, Aaron Courville; pp 196, MIT Press, 2016.
[2] “Using Deep Learning for Information Security – Part 1,” Acalvio Blog, 2018.
[3] “Malware detection using machine learning,” Dragoş Gavriluţ, Mihai Cimpoeşu, Dan Anton, Liviu Ciortuz; International Multiconference on Computer Science and Information Technology, Mragowo, 2009
[4] “The increased use of Powershell in attacks,” Symantec, 2016
[5] “Odinaff: New Trojan used in financial attacks,” Symantec Security Response, Oct 2016
[6] “Empire,” Will Schroeder, Justin Warner, Matt Nelson, Steve Borosh, Alex Rymdeko-harvey, Chris Ross; 2017
[7] “PowerSploit,” Matthew Graeber; 2012
[8] “Revoke-Obfuscation,” Daniel Bohannan, Lee Holmes; 2017
[9] “PowerShell Corpus”
[10] “PoshCode – PowerShell Projects for Power Users.”
[11] “Invoke-Obfuscation,” Daniel Bohannan; 2017
[12] ”More Detecting Obfuscated PowerShell,” Lee Holmes; 2016
[13] The Definitive Guide to Deception, Acalvio Technologies.
[14] ShadowPlex, Acalvio Technologies.