Can LLMs Obfuscate Code? A Systematic Analysis of Large Language Models into Assembly Code Obfuscation
Seyedreza Mohseni, Seyedali Mohammadi, Deepa Tilwani, Yash Saxena, Gerald Ketu Ndawula, Sriram Vema, Edward Raff, Manas Gaur
TL;DR
The paper tackles the threat of LLMs enabling assembly-code obfuscation by introducing the MetamorphASM (MAD) benchmark and a 328,200-sample MAD dataset built from three obfuscation techniques. It benchmarks a broad set of LLMs under zero-shot and few-shot prompting, using Delta Entropy $\Delta H_{AB}$ and Cosine Similarity (CS), complemented by human evaluation to validate obfuscation quality. Key findings show that GPT-family models, notably GPT-4o-mini, can produce obfuscated assembly with entropy and structural changes comparable to expert-crafted transformations, highlighting significant security implications for AV engines and malware defenses. The work provides a concrete resource for studying remediations and for testing local LLMs, offering a pathway to assess and mitigate LLM-enabled metamorphic obfuscation in real-world settings.
Abstract
Malware authors often employ code obfuscations to make their malware harder to detect. Existing tools for generating obfuscated code often require access to the original source code (e.g., C++ or Java), and adding new obfuscations is a non-trivial, labor-intensive process. In this study, we ask the following question: Can Large Language Models (LLMs) potentially generate a new obfuscated assembly code? If so, this poses a risk to anti-virus engines and potentially increases the flexibility of attackers to create new obfuscation patterns. We answer this in the affirmative by developing the MetamorphASM benchmark comprising MetamorphASM Dataset (MAD) along with three code obfuscation techniques: dead code, register substitution, and control flow change. The MetamorphASM systematically evaluates the ability of LLMs to generate and analyze obfuscated code using MAD, which contains 328,200 obfuscated assembly code samples. We release this dataset and analyze the success rate of various LLMs (e.g., GPT-3.5/4, GPT-4o-mini, Starcoder, CodeGemma, CodeLlama, CodeT5, and LLaMA 3.1) in generating obfuscated assembly code. The evaluation was performed using established information-theoretic metrics and manual human review to ensure correctness and provide the foundation for researchers to study and develop remediations to this risk.
