Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns

Constantinos Patsakis; Fran Casino; Nikolaos Lykousas

Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns

Constantinos Patsakis, Fran Casino, Nikolaos Lykousas

TL;DR

This paper investigates the use of state-of-the-art large language models (LLMs) to deobfuscate real-world malware payloads, focusing on Emotet-derived PowerShell scripts. It builds a practical pipeline and evaluates four LLMs (GPT-4, Gemini Pro, Code Llama Instruct, Mixtral) on 2,000 obfuscated scripts, measuring the ability to extract dropper URLs as a proxy for successful deobfuscation. Results show GPT-4 achieving the highest URL-identification accuracy (about 69.6%), with diminishing performance for other models, and reveal issues such as hallucinations and occasional refusals that challenge automation. The work demonstrates the potential for fine-tuned LLMs to augment threat-intelligence pipelines, enabling automated extraction of critical indicators while highlighting the need for robust prompting, input-size management, and ethical considerations in deployment.

Abstract

The integration of large language models (LLMs) into various pipelines is increasingly widespread, effectively automating many manual tasks and often surpassing human capabilities. Cybersecurity researchers and practitioners have recognised this potential. Thus, they are actively exploring its applications, given the vast volume of heterogeneous data that requires processing to identify anomalies, potential bypasses, attacks, and fraudulent incidents. On top of this, LLMs' advanced capabilities in generating functional code, comprehending code context, and summarising its operations can also be leveraged for reverse engineering and malware deobfuscation. To this end, we delve into the deobfuscation capabilities of state-of-the-art LLMs. Beyond merely discussing a hypothetical scenario, we evaluate four LLMs with real-world malicious scripts used in the notorious Emotet malware campaign. Our results indicate that while not absolutely accurate yet, some LLMs can efficiently deobfuscate such payloads. Thus, fine-tuning LLMs for this task can be a viable potential for future AI-powered threat intelligence pipelines in the fight against obfuscated malware.

Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns

TL;DR

Abstract

Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns

Authors

TL;DR

Abstract

Table of Contents

Figures (9)