UniASM: Binary Code Similarity Detection without Fine-tuning
Yeming Gu, Hui Shu, Fei Kang, Fan Hu
TL;DR
UniASM introduces a UniLM-based approach for binary code similarity detection (BCSD) that avoids fine-tuning. It proposes a rich-semantic function representation pipeline (instruction normalization, assembly tokenization, linear function serialization) and two pretraining tasks (ALG and SFP) to produce direct-function embeddings, enabling effective BCSD without task-specific fine-tuning. Empirical results show UniASM achieves state-of-the-art performance across cross-compilers, cross-optimizations, and cross-obfuscations, with notable gains in vulnerability-search contexts. The work also provides extensive ablations on backbone models, training tasks, normalization, tokenization, and sequence length, offering practical guidance for robust BCSD in real-world scenarios.
Abstract
Binary code similarity detection (BCSD) is widely used in various binary analysis tasks such as vulnerability search, malware detection, clone detection, and patch analysis. Recent studies have shown that the learning-based binary code embedding models perform better than the traditional feature-based approaches. However, previous studies have not delved deeply into the key factors that affect model performance. In this paper, we design extensive ablation studies to explore these influencing factors. The experimental results have provided us with many new insights. We have made innovations in both code representation and model selection: we propose a novel rich-semantic function representation technique to ensure the model captures the intricate nuances of binary code, and we introduce the first UniLM-based binary code embedding model, named UniASM, which includes two newly designed training tasks to learn representations of binary functions. The experimental results show that UniASM outperforms the state-of-the-art (SOTA) approaches on the evaluation datasets. The average scores of Recall@1 on cross-compilers, cross-optimization-levels, and cross-obfuscations have improved by 12.7%, 8.5%, and 22.3%, respectively, compared to the best of the baseline methods. Besides, in the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
