Table of Contents
Fetching ...

UniASM: Binary Code Similarity Detection without Fine-tuning

Yeming Gu, Hui Shu, Fei Kang, Fan Hu

TL;DR

UniASM introduces a UniLM-based approach for binary code similarity detection (BCSD) that avoids fine-tuning. It proposes a rich-semantic function representation pipeline (instruction normalization, assembly tokenization, linear function serialization) and two pretraining tasks (ALG and SFP) to produce direct-function embeddings, enabling effective BCSD without task-specific fine-tuning. Empirical results show UniASM achieves state-of-the-art performance across cross-compilers, cross-optimizations, and cross-obfuscations, with notable gains in vulnerability-search contexts. The work also provides extensive ablations on backbone models, training tasks, normalization, tokenization, and sequence length, offering practical guidance for robust BCSD in real-world scenarios.

Abstract

Binary code similarity detection (BCSD) is widely used in various binary analysis tasks such as vulnerability search, malware detection, clone detection, and patch analysis. Recent studies have shown that the learning-based binary code embedding models perform better than the traditional feature-based approaches. However, previous studies have not delved deeply into the key factors that affect model performance. In this paper, we design extensive ablation studies to explore these influencing factors. The experimental results have provided us with many new insights. We have made innovations in both code representation and model selection: we propose a novel rich-semantic function representation technique to ensure the model captures the intricate nuances of binary code, and we introduce the first UniLM-based binary code embedding model, named UniASM, which includes two newly designed training tasks to learn representations of binary functions. The experimental results show that UniASM outperforms the state-of-the-art (SOTA) approaches on the evaluation datasets. The average scores of Recall@1 on cross-compilers, cross-optimization-levels, and cross-obfuscations have improved by 12.7%, 8.5%, and 22.3%, respectively, compared to the best of the baseline methods. Besides, in the real-world task of known vulnerability search, UniASM outperforms all the current baselines.

UniASM: Binary Code Similarity Detection without Fine-tuning

TL;DR

UniASM introduces a UniLM-based approach for binary code similarity detection (BCSD) that avoids fine-tuning. It proposes a rich-semantic function representation pipeline (instruction normalization, assembly tokenization, linear function serialization) and two pretraining tasks (ALG and SFP) to produce direct-function embeddings, enabling effective BCSD without task-specific fine-tuning. Empirical results show UniASM achieves state-of-the-art performance across cross-compilers, cross-optimizations, and cross-obfuscations, with notable gains in vulnerability-search contexts. The work also provides extensive ablations on backbone models, training tasks, normalization, tokenization, and sequence length, offering practical guidance for robust BCSD in real-world scenarios.

Abstract

Binary code similarity detection (BCSD) is widely used in various binary analysis tasks such as vulnerability search, malware detection, clone detection, and patch analysis. Recent studies have shown that the learning-based binary code embedding models perform better than the traditional feature-based approaches. However, previous studies have not delved deeply into the key factors that affect model performance. In this paper, we design extensive ablation studies to explore these influencing factors. The experimental results have provided us with many new insights. We have made innovations in both code representation and model selection: we propose a novel rich-semantic function representation technique to ensure the model captures the intricate nuances of binary code, and we introduce the first UniLM-based binary code embedding model, named UniASM, which includes two newly designed training tasks to learn representations of binary functions. The experimental results show that UniASM outperforms the state-of-the-art (SOTA) approaches on the evaluation datasets. The average scores of Recall@1 on cross-compilers, cross-optimization-levels, and cross-obfuscations have improved by 12.7%, 8.5%, and 22.3%, respectively, compared to the best of the baseline methods. Besides, in the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
Paper Structure (37 sections, 13 equations, 12 figures, 7 tables)

This paper contains 37 sections, 13 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Overview of UniASM.
  • Figure 2: Backbone network.
  • Figure 3: Assembly Language Generation.
  • Figure 4: Similar Function Prediction.
  • Figure 5: Performance of different models.
  • ...and 7 more figures