ReGraph: A Tool for Binary Similarity Identification
Li Zhou, Marc Dacier, Charalambos Konstantinou
TL;DR
ReGraph tackles binary code similarity across architectures and optimization levels by combining a lifter (RecDec), a re-optimizer (LLVM -O3), Code Property Graphs, and a Graph Neural Network to produce function-level embeddings. Similarity is computed via Pearson correlation between function vectors, enabling fast top-K matching and reducing analyst workload. The approach yields substantial speedups—over $700\times$ faster than NLP-based methods—with comparable accuracy to state-of-the-art BCSD tools, and benefits from a lightweight, open-source implementation. This work highlights lifting and re-optimization as a general meta-method to improve BCSD pipelines and supports practical large-scale binary analysis tasks.
Abstract
Binary Code Similarity Detection (BCSD) is not only essential for security tasks such as vulnerability identification but also for code copying detection, yet it remains challenging due to binary stripping and diverse compilation environments. Existing methods tend to adopt increasingly complex neural networks for better accuracy performance. The computation time increases with the complexity. Even with powerful GPUs, the treatment of large-scale software becomes time-consuming. To address these issues, we present a framework called ReGraph to efficiently compare binary code functions across architectures and optimization levels. Our evaluation with public datasets highlights that ReGraph exhibits a significant speed advantage, performing 700 times faster than Natural Language Processing (NLP)-based methods while maintaining comparable accuracy results with respect to the state-of-the-art models.
