Table of Contents
Fetching ...

Living off the Analyst: Harvesting Features from Yara Rules for Malware Detection

Siddhant Gupta, Fred Lu, Andrew Barlow, Edward Raff, Francis Ferraro, Cynthia Matuszek, Charles Nicholas, James Holt

TL;DR

By extracting sub-signatures from publicly available YARA rules, a set of features that can more effectively discriminate malicious samples from benign ones are assembled, and it is demonstrated that these features add value beyond traditional features on the EMBER 2018 dataset.

Abstract

A strategy used by malicious actors is to "live off the land," where benign systems and tools already available on a victim's systems are used and repurposed for the malicious actor's intent. In this work, we ask if there is a way for anti-virus developers to similarly re-purpose existing work to improve their malware detection capability. We show that this is plausible via YARA rules, which use human-written signatures to detect specific malware families, functionalities, or other markers of interest. By extracting sub-signatures from publicly available YARA rules, we assembled a set of features that can more effectively discriminate malicious samples from benign ones. Our experiments demonstrate that these features add value beyond traditional features on the EMBER 2018 dataset. Manual analysis of the added sub-signatures shows a power-law behavior in a combination of features that are specific and unique, as well as features that occur often. A prior expectation may be that the features would be limited in being overly specific to unique malware families. This behavior is observed, and is apparently useful in practice. In addition, we also find sub-signatures that are dual-purpose (e.g., detecting virtual machine environments) or broadly generic (e.g., DLL imports).

Living off the Analyst: Harvesting Features from Yara Rules for Malware Detection

TL;DR

By extracting sub-signatures from publicly available YARA rules, a set of features that can more effectively discriminate malicious samples from benign ones are assembled, and it is demonstrated that these features add value beyond traditional features on the EMBER 2018 dataset.

Abstract

A strategy used by malicious actors is to "live off the land," where benign systems and tools already available on a victim's systems are used and repurposed for the malicious actor's intent. In this work, we ask if there is a way for anti-virus developers to similarly re-purpose existing work to improve their malware detection capability. We show that this is plausible via YARA rules, which use human-written signatures to detect specific malware families, functionalities, or other markers of interest. By extracting sub-signatures from publicly available YARA rules, we assembled a set of features that can more effectively discriminate malicious samples from benign ones. Our experiments demonstrate that these features add value beyond traditional features on the EMBER 2018 dataset. Manual analysis of the added sub-signatures shows a power-law behavior in a combination of features that are specific and unique, as well as features that occur often. A prior expectation may be that the features would be limited in being overly specific to unique malware families. This behavior is observed, and is apparently useful in practice. In addition, we also find sub-signatures that are dual-purpose (e.g., detecting virtual machine environments) or broadly generic (e.g., DLL imports).

Paper Structure

This paper contains 20 sections, 3 equations, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: Example of a Yara signature from https://github.com/tjnel/yara_repo/blob/master/ransomware/crime_win_ransom_anubi.yara. The highlighted "strings" of the Yara rule will be extracted and treated as 20 different features. The conditional logic that combines these components is ignored in our approach because it makes the whole rule as a singular unit too rare to be useful to a machine-learning model. By focusing on the individual components that we term sub-signatures, we obtain sufficient frequency in occurrence that they aid a predictive model.
  • Figure 2: Distributional statistics of Yara sub-signatures over the Ember dataset in terms of predicting whether a file is malware. This shows that individual sub-signatures have little predictive power and tend to be highly specific in what they detect. Thus, we need to consider a joint feature selection to obtain meaningful results.
  • Figure 3: Using only Yara sub-signatures as features, we see that only a limited number of the original 19k sub-signatures are needed for predictive accuracy. This also shows that training a tree ensemble after feature selection provides higher accuracy, suggesting a two-step model-building process. This informs our strategy for the full approach of leveraging Yara sub-signatures.
  • Figure 4: In this experiment, linear classifiers are trained on a subset of Yara sub-signatures to predict Ember test labels. We compare metric-based selection of Yara sub-signatures (filtering by top accuracy, precision, or recall) versus automated feature selection via the Lasso penalty. Lasso selection gives higher accuracy for any feature set size.
  • Figure 5: Performance of our full model as presented in Algo. \ref{['alg:hayama']}, showing how (a) test accuracy, and (b) AUC at low FPR, increases as Yara features are added. A Yara-only model without Ember features (blue) shows the importance of using side information. Furthermore, Yara features demonstrate clear value when added to the baseline Ember-only model (pink).
  • ...and 4 more figures