Table of Contents
Fetching ...

Malware Detection based on API calls

Christofer Fellicious, Manuel Bischof, Kevin Mayer, Dorian Eikenberg, Stefan Hausotte, Hans P. Reiser, Michael Granitzer

TL;DR

This work addresses the challenge of detecting malware using API-call activity by proposing an order-invariant, frequency-based representation that omits temporal sequencing. It introduces the largest publicly available dataset of API-call traces, focusing on the ntdll.dll library, with over 300k malware and 10k benign samples totaling more than 550 GB, and provides open-source code and documentation. By modeling API calls with unigram, bigram, and trigram features (and a combined feature set) and training a Random Forest classifier, the study shows that high detection accuracy (F1 about 0.85–0.89; ROC AUC around 0.98) can be achieved with a relatively small number of calls, with 2500 calls identified as a practical operating point. The findings demonstrate that simple, well-engineered frequency features can yield efficient, scalable real-time malware detection, and they establish a valuable resource for the cybersecurity research community.

Abstract

Malware attacks pose a significant threat in today's interconnected digital landscape, causing billions of dollars in damages. Detecting and identifying families as early as possible provides an edge in protecting against such malware. We explore a lightweight, order-invariant approach to detecting and mitigating malware threats: analyzing API calls without regard to their sequence. We publish a public dataset of over three hundred thousand samples and their function call parameters for this task, annotated with labels indicating benign or malicious activity. The complete dataset is above 550GB uncompressed in size. We leverage machine learning algorithms, such as random forests, and conduct behavioral analysis by examining patterns and anomalies in API call sequences. By investigating how the function calls occur regardless of their order, we can identify discriminating features that can help us identify malware early on. The models we've developed are not only effective but also efficient. They are lightweight and can run on any machine with minimal performance overhead, while still achieving an impressive F1-Score of over 85\%. We also empirically show that we only need a subset of the function call sequence, specifically calls to the ntdll.dll library, to identify malware. Our research demonstrates the efficacy of this approach through empirical evaluations, underscoring its accuracy and scalability. The code is open source and available at Github along with the dataset on Zenodo.

Malware Detection based on API calls

TL;DR

This work addresses the challenge of detecting malware using API-call activity by proposing an order-invariant, frequency-based representation that omits temporal sequencing. It introduces the largest publicly available dataset of API-call traces, focusing on the ntdll.dll library, with over 300k malware and 10k benign samples totaling more than 550 GB, and provides open-source code and documentation. By modeling API calls with unigram, bigram, and trigram features (and a combined feature set) and training a Random Forest classifier, the study shows that high detection accuracy (F1 about 0.85–0.89; ROC AUC around 0.98) can be achieved with a relatively small number of calls, with 2500 calls identified as a practical operating point. The findings demonstrate that simple, well-engineered frequency features can yield efficient, scalable real-time malware detection, and they establish a valuable resource for the cybersecurity research community.

Abstract

Malware attacks pose a significant threat in today's interconnected digital landscape, causing billions of dollars in damages. Detecting and identifying families as early as possible provides an edge in protecting against such malware. We explore a lightweight, order-invariant approach to detecting and mitigating malware threats: analyzing API calls without regard to their sequence. We publish a public dataset of over three hundred thousand samples and their function call parameters for this task, annotated with labels indicating benign or malicious activity. The complete dataset is above 550GB uncompressed in size. We leverage machine learning algorithms, such as random forests, and conduct behavioral analysis by examining patterns and anomalies in API call sequences. By investigating how the function calls occur regardless of their order, we can identify discriminating features that can help us identify malware early on. The models we've developed are not only effective but also efficient. They are lightweight and can run on any machine with minimal performance overhead, while still achieving an impressive F1-Score of over 85\%. We also empirically show that we only need a subset of the function call sequence, specifically calls to the ntdll.dll library, to identify malware. Our research demonstrates the efficacy of this approach through empirical evaluations, underscoring its accuracy and scalability. The code is open source and available at Github along with the dataset on Zenodo.

Paper Structure

This paper contains 7 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Sample log of a single API Call
  • Figure 2: F1-Score for all our models at different max API call counts. The X-axis is on a logarithmic scale. The results are the average of four different runs.
  • Figure 3: Precision Recall curve for the Unigram, Bigram, Trigram, and combined models. We see that the Trigram model confidence drops very quickly compared to the other models.