Table of Contents
Fetching ...

CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection

Hao Wang, Zeyu Gao, Chao Zhang, Mingyang Sun, Yuchen Zhou, Han Qiu, Xi Xiao

TL;DR

This work tackles the challenge of large-scale binary code similarity detection by proposing CEBin, a cost-effective framework that fuses an enhanced embedding-based retrieval stage with a subsequent comparison-based scoring stage in a hierarchical inference pipeline. It introduces RECM, a reusable embedding cache mechanism, to inflate negative sampling during embedding fine-tuning, and pairs it with a separate comparison model trained via a triplet objective to achieve high-precision final results. Evaluations on three BCSD datasets and a large-scale 1-day vulnerability benchmark demonstrate substantial improvements over SOTA baselines in cross-architecture, cross-compiler, and cross-optimization settings, as well as strong scalability to millions of candidate functions with competitive inference costs. The results indicate that CEBin can effectively identify similar (including vulnerable) functions in large software ecosystems, enabling practical deployment for software supply-chain security tasks and vulnerability discovery. The authors also provide a public release of their code and a large vulnerability benchmark to support future BCSD research.

Abstract

Binary code similarity detection (BCSD) is a fundamental technique for various application. Many BCSD solutions have been proposed recently, which mostly are embedding-based, but have shown limited accuracy and efficiency especially when the volume of target binaries to search is large. To address this issue, we propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches to significantly improve accuracy while minimizing overheads. Specifically, CEBin utilizes a refined embedding-based approach to extract features of target code, which efficiently narrows down the scope of candidate similar code and boosts performance. Then, it utilizes a comparison-based approach that performs a pairwise comparison on the candidates to capture more nuanced and complex relationships, which greatly improves the accuracy of similarity detection. By bridging the gap between embedding-based and comparison-based approaches, CEBin is able to provide an effective and efficient solution for detecting similar code (including vulnerable ones) in large-scale software ecosystems. Experimental results on three well-known datasets demonstrate the superiority of CEBin over existing state-of-the-art (SOTA) baselines. To further evaluate the usefulness of BCSD in real world, we construct a large-scale benchmark of vulnerability, offering the first precise evaluation scheme to assess BCSD methods for the 1-day vulnerability detection task. CEBin could identify the similar function from millions of candidate functions in just a few seconds and achieves an impressive recall rate of $85.46\%$ on this more practical but challenging task, which are several order of magnitudes faster and $4.07\times$ better than the best SOTA baseline. Our code is available at https://github.com/Hustcw/CEBin.

CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection

TL;DR

This work tackles the challenge of large-scale binary code similarity detection by proposing CEBin, a cost-effective framework that fuses an enhanced embedding-based retrieval stage with a subsequent comparison-based scoring stage in a hierarchical inference pipeline. It introduces RECM, a reusable embedding cache mechanism, to inflate negative sampling during embedding fine-tuning, and pairs it with a separate comparison model trained via a triplet objective to achieve high-precision final results. Evaluations on three BCSD datasets and a large-scale 1-day vulnerability benchmark demonstrate substantial improvements over SOTA baselines in cross-architecture, cross-compiler, and cross-optimization settings, as well as strong scalability to millions of candidate functions with competitive inference costs. The results indicate that CEBin can effectively identify similar (including vulnerable) functions in large software ecosystems, enabling practical deployment for software supply-chain security tasks and vulnerability discovery. The authors also provide a public release of their code and a large vulnerability benchmark to support future BCSD research.

Abstract

Binary code similarity detection (BCSD) is a fundamental technique for various application. Many BCSD solutions have been proposed recently, which mostly are embedding-based, but have shown limited accuracy and efficiency especially when the volume of target binaries to search is large. To address this issue, we propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches to significantly improve accuracy while minimizing overheads. Specifically, CEBin utilizes a refined embedding-based approach to extract features of target code, which efficiently narrows down the scope of candidate similar code and boosts performance. Then, it utilizes a comparison-based approach that performs a pairwise comparison on the candidates to capture more nuanced and complex relationships, which greatly improves the accuracy of similarity detection. By bridging the gap between embedding-based and comparison-based approaches, CEBin is able to provide an effective and efficient solution for detecting similar code (including vulnerable ones) in large-scale software ecosystems. Experimental results on three well-known datasets demonstrate the superiority of CEBin over existing state-of-the-art (SOTA) baselines. To further evaluate the usefulness of BCSD in real world, we construct a large-scale benchmark of vulnerability, offering the first precise evaluation scheme to assess BCSD methods for the 1-day vulnerability detection task. CEBin could identify the similar function from millions of candidate functions in just a few seconds and achieves an impressive recall rate of on this more practical but challenging task, which are several order of magnitudes faster and better than the best SOTA baseline. Our code is available at https://github.com/Hustcw/CEBin.
Paper Structure (33 sections, 3 equations, 10 figures, 9 tables)

This paper contains 33 sections, 3 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: The embedding-based model (left) represents functions $x,y$ as embeddings and calculate similarity with similarity metrics (e.g cosine). The comparison-based model (right) takes a pair of functions and outputs their similarity.
  • Figure 2: The Workflow of CEBin.
  • Figure 3: The illustration of fine-tuning phase for CEBin. In stage 1, semantically equivalent function pairs $(Q_i, R_i)$ are encoded with query encoder and the reference encoder respectively. The corresponding pairs $(Q_i, R_i)$ are considered as positive pairs. And other pairs $(Q_i, R_j)_{i \neq j}$ along with all pairs $(Q_i, R'_j)$ containing previous reference functions in the Resuable Embedding Cache are considered as negative paris. The InfoNCELoss is calcuated given positive pairs and massive negative pairs. The loss is back-propagated to update query encoder and momentum is used to update the refernece encoder. In stage 2, pairs of functions are feed into model simultaneously after concatenation. $(Q_i, R_i)$ is considered as a positive pair and $(Q_i, R_{i+1})$ is considered as a negative pair. Then we use the triple loss to train the comparison model.
  • Figure 4: The illustration of inference for CEBin. In stage 1, we use a reference encoder to encode all functions we aim to compare into vectors. In stage 2, we build a vector index for each function and its corresponding vector using ANN algorithm, so that we can retrieve K most similar vectors given a query vector. In stage 3, given a query function, we use a query encoder to obtain the embedding vector and retrieve the top-K closest functions from the function pool using a pre-built vector index. Then the K candidate functions along with the query function are fed into the comparison model to perform the final selection.
  • Figure 5: The performance of different binary similarity detection methods on BinaryCorp. The x-axis is logarithmic and denotes the poolsize.
  • ...and 5 more figures