Table of Contents
Fetching ...

Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns

Constantinos Patsakis, Fran Casino, Nikolaos Lykousas

TL;DR

This paper investigates the use of state-of-the-art large language models (LLMs) to deobfuscate real-world malware payloads, focusing on Emotet-derived PowerShell scripts. It builds a practical pipeline and evaluates four LLMs (GPT-4, Gemini Pro, Code Llama Instruct, Mixtral) on 2,000 obfuscated scripts, measuring the ability to extract dropper URLs as a proxy for successful deobfuscation. Results show GPT-4 achieving the highest URL-identification accuracy (about 69.6%), with diminishing performance for other models, and reveal issues such as hallucinations and occasional refusals that challenge automation. The work demonstrates the potential for fine-tuned LLMs to augment threat-intelligence pipelines, enabling automated extraction of critical indicators while highlighting the need for robust prompting, input-size management, and ethical considerations in deployment.

Abstract

The integration of large language models (LLMs) into various pipelines is increasingly widespread, effectively automating many manual tasks and often surpassing human capabilities. Cybersecurity researchers and practitioners have recognised this potential. Thus, they are actively exploring its applications, given the vast volume of heterogeneous data that requires processing to identify anomalies, potential bypasses, attacks, and fraudulent incidents. On top of this, LLMs' advanced capabilities in generating functional code, comprehending code context, and summarising its operations can also be leveraged for reverse engineering and malware deobfuscation. To this end, we delve into the deobfuscation capabilities of state-of-the-art LLMs. Beyond merely discussing a hypothetical scenario, we evaluate four LLMs with real-world malicious scripts used in the notorious Emotet malware campaign. Our results indicate that while not absolutely accurate yet, some LLMs can efficiently deobfuscate such payloads. Thus, fine-tuning LLMs for this task can be a viable potential for future AI-powered threat intelligence pipelines in the fight against obfuscated malware.

Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns

TL;DR

This paper investigates the use of state-of-the-art large language models (LLMs) to deobfuscate real-world malware payloads, focusing on Emotet-derived PowerShell scripts. It builds a practical pipeline and evaluates four LLMs (GPT-4, Gemini Pro, Code Llama Instruct, Mixtral) on 2,000 obfuscated scripts, measuring the ability to extract dropper URLs as a proxy for successful deobfuscation. Results show GPT-4 achieving the highest URL-identification accuracy (about 69.6%), with diminishing performance for other models, and reveal issues such as hallucinations and occasional refusals that challenge automation. The work demonstrates the potential for fine-tuned LLMs to augment threat-intelligence pipelines, enabling automated extraction of critical indicators while highlighting the need for robust prompting, input-size management, and ethical considerations in deployment.

Abstract

The integration of large language models (LLMs) into various pipelines is increasingly widespread, effectively automating many manual tasks and often surpassing human capabilities. Cybersecurity researchers and practitioners have recognised this potential. Thus, they are actively exploring its applications, given the vast volume of heterogeneous data that requires processing to identify anomalies, potential bypasses, attacks, and fraudulent incidents. On top of this, LLMs' advanced capabilities in generating functional code, comprehending code context, and summarising its operations can also be leveraged for reverse engineering and malware deobfuscation. To this end, we delve into the deobfuscation capabilities of state-of-the-art LLMs. Beyond merely discussing a hypothetical scenario, we evaluate four LLMs with real-world malicious scripts used in the notorious Emotet malware campaign. Our results indicate that while not absolutely accurate yet, some LLMs can efficiently deobfuscate such payloads. Thus, fine-tuning LLMs for this task can be a viable potential for future AI-powered threat intelligence pipelines in the fight against obfuscated malware.
Paper Structure (9 sections, 9 figures)

This paper contains 9 sections, 9 figures.

Figures (9)

  • Figure 1: The modus operandi of Emotet.
  • Figure 2: A sample of the PowerShell payload Emotet. MD5 hash of the sample: 26d63ca2075c04107c0ab42184b86d3c.
  • Figure 3: The environment used for the analysis of the malicious documents.
  • Figure 4: Structure of the task prompts used in OpenAI's GPT (top) and Code LLama (bottom).
  • Figure 5: Comparison of the results for each LLM. GPT-4, Gemini Pro, Code Llama, Mixtral
  • ...and 4 more figures