Table of Contents
Fetching ...

Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations

Xiuwei Shang, Li Hu, Shaoyin Cheng, Guoqiang Chen, Benlong Wu, Weiming Zhang, Nenghai Yu

TL;DR

IRBinDiff is proposed, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction, and integrates a pre-trained language model with a graph neural network to capture both semantic and structural information from different perspectives to enhance retrieval capabilities in large-scale candidate function sets.

Abstract

Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification. As IoT devices proliferate and rapidly evolve, their highly heterogeneous hardware architectures and complex compilation settings, coupled with the demand for large-scale function retrieval in practical applications, put forward higher requirements for BCSD methods. In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction, and integrates a pre-trained language model with a graph neural network to capture both semantic and structural information from different perspectives. By introducing momentum contrastive learning, it effectively enhances retrieval capabilities in large-scale candidate function sets, distinguishing between subtle function similarities and differences. Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.

Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations

TL;DR

IRBinDiff is proposed, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction, and integrates a pre-trained language model with a graph neural network to capture both semantic and structural information from different perspectives to enhance retrieval capabilities in large-scale candidate function sets.

Abstract

Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification. As IoT devices proliferate and rapidly evolve, their highly heterogeneous hardware architectures and complex compilation settings, coupled with the demand for large-scale function retrieval in practical applications, put forward higher requirements for BCSD methods. In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction, and integrates a pre-trained language model with a graph neural network to capture both semantic and structural information from different perspectives. By introducing momentum contrastive learning, it effectively enhances retrieval capabilities in large-scale candidate function sets, distinguishing between subtle function similarities and differences. Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.

Paper Structure

This paper contains 25 sections, 12 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Code Example. (a) is the source code for bubble sort, and (b) (c) (d) are the assembly code with control-flow of the binary compiled with x64_gcc-11.4.0_O1, x86_gcc-11.4.0_O3, and x86_clang-14.0.0_O3, respectively.
  • Figure 2: Overview framework of IRBinDiff.
  • Figure 3: Input representation of language model.
  • Figure 4: Masked Language Model (MLM).
  • Figure 5: Next Sentence Prediction (NSP).
  • ...and 5 more figures