Disassembling Obfuscated Executables with LLM
Huanyao Rong, Yue Duan, Hang Zhang, XiaoFeng Wang, Hongbo Chen, Shengchen Duan, Shen Wang
TL;DR
DisasLLM presents an end-to-end disassembly framework that leverages a fine-tuned LLM-based validity classifier to identify correctly decoded instructions within obfuscated binaries. The system blends a traditional linear/recursive disassembler with LLM-driven checks, using prefilters and BFS-based context to fix mis-disassembled regions, achieving substantial improvements over state-of-the-art baselines on heavily obfuscated benchmarks. Key contributions include a token-classification approach to instruction validity, an end-to-end disassembly strategy, and extensive evaluation demonstrating robustness to junk bytes and obfuscation. The work highlights the practical potential of integrating LLMs with classical disassembly techniques to enhance resilience against obfuscation in binary analysis.
Abstract
Disassembly is a challenging task, particularly for obfuscated executables containing junk bytes, which is designed to induce disassembly errors. Existing solutions rely on heuristics or leverage machine learning techniques, but only achieve limited successes. Fundamentally, such obfuscation cannot be defeated without in-depth understanding of the binary executable's semantics, which is made possible by the emergence of large language models (LLMs). In this paper, we present DisasLLM, a novel LLM-driven dissembler to overcome the challenge in analyzing obfuscated executables. DisasLLM consists of two components: an LLM-based classifier that determines whether an instruction in an assembly code snippet is correctly decoded, and a disassembly strategy that leverages this model to disassemble obfuscated executables end-to-end. We evaluated DisasLLM on a set of heavily obfuscated executables, which is shown to significantly outperform other state-of-the-art disassembly solutions.
