Table of Contents
Fetching ...

ReGraph: A Tool for Binary Similarity Identification

Li Zhou, Marc Dacier, Charalambos Konstantinou

TL;DR

ReGraph tackles binary code similarity across architectures and optimization levels by combining a lifter (RecDec), a re-optimizer (LLVM -O3), Code Property Graphs, and a Graph Neural Network to produce function-level embeddings. Similarity is computed via Pearson correlation between function vectors, enabling fast top-K matching and reducing analyst workload. The approach yields substantial speedups—over $700\times$ faster than NLP-based methods—with comparable accuracy to state-of-the-art BCSD tools, and benefits from a lightweight, open-source implementation. This work highlights lifting and re-optimization as a general meta-method to improve BCSD pipelines and supports practical large-scale binary analysis tasks.

Abstract

Binary Code Similarity Detection (BCSD) is not only essential for security tasks such as vulnerability identification but also for code copying detection, yet it remains challenging due to binary stripping and diverse compilation environments. Existing methods tend to adopt increasingly complex neural networks for better accuracy performance. The computation time increases with the complexity. Even with powerful GPUs, the treatment of large-scale software becomes time-consuming. To address these issues, we present a framework called ReGraph to efficiently compare binary code functions across architectures and optimization levels. Our evaluation with public datasets highlights that ReGraph exhibits a significant speed advantage, performing 700 times faster than Natural Language Processing (NLP)-based methods while maintaining comparable accuracy results with respect to the state-of-the-art models.

ReGraph: A Tool for Binary Similarity Identification

TL;DR

ReGraph tackles binary code similarity across architectures and optimization levels by combining a lifter (RecDec), a re-optimizer (LLVM -O3), Code Property Graphs, and a Graph Neural Network to produce function-level embeddings. Similarity is computed via Pearson correlation between function vectors, enabling fast top-K matching and reducing analyst workload. The approach yields substantial speedups—over faster than NLP-based methods—with comparable accuracy to state-of-the-art BCSD tools, and benefits from a lightweight, open-source implementation. This work highlights lifting and re-optimization as a general meta-method to improve BCSD pipelines and supports practical large-scale binary analysis tasks.

Abstract

Binary Code Similarity Detection (BCSD) is not only essential for security tasks such as vulnerability identification but also for code copying detection, yet it remains challenging due to binary stripping and diverse compilation environments. Existing methods tend to adopt increasingly complex neural networks for better accuracy performance. The computation time increases with the complexity. Even with powerful GPUs, the treatment of large-scale software becomes time-consuming. To address these issues, we present a framework called ReGraph to efficiently compare binary code functions across architectures and optimization levels. Our evaluation with public datasets highlights that ReGraph exhibits a significant speed advantage, performing 700 times faster than Natural Language Processing (NLP)-based methods while maintaining comparable accuracy results with respect to the state-of-the-art models.

Paper Structure

This paper contains 15 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The assembly code of example functions compiled in ARM -O3 (left) and X86 -O0 (right).
  • Figure 2: The structure of ReGraph.