CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection
Hao Wang, Zeyu Gao, Chao Zhang, Mingyang Sun, Yuchen Zhou, Han Qiu, Xi Xiao
TL;DR
This work tackles the challenge of large-scale binary code similarity detection by proposing CEBin, a cost-effective framework that fuses an enhanced embedding-based retrieval stage with a subsequent comparison-based scoring stage in a hierarchical inference pipeline. It introduces RECM, a reusable embedding cache mechanism, to inflate negative sampling during embedding fine-tuning, and pairs it with a separate comparison model trained via a triplet objective to achieve high-precision final results. Evaluations on three BCSD datasets and a large-scale 1-day vulnerability benchmark demonstrate substantial improvements over SOTA baselines in cross-architecture, cross-compiler, and cross-optimization settings, as well as strong scalability to millions of candidate functions with competitive inference costs. The results indicate that CEBin can effectively identify similar (including vulnerable) functions in large software ecosystems, enabling practical deployment for software supply-chain security tasks and vulnerability discovery. The authors also provide a public release of their code and a large vulnerability benchmark to support future BCSD research.
Abstract
Binary code similarity detection (BCSD) is a fundamental technique for various application. Many BCSD solutions have been proposed recently, which mostly are embedding-based, but have shown limited accuracy and efficiency especially when the volume of target binaries to search is large. To address this issue, we propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches to significantly improve accuracy while minimizing overheads. Specifically, CEBin utilizes a refined embedding-based approach to extract features of target code, which efficiently narrows down the scope of candidate similar code and boosts performance. Then, it utilizes a comparison-based approach that performs a pairwise comparison on the candidates to capture more nuanced and complex relationships, which greatly improves the accuracy of similarity detection. By bridging the gap between embedding-based and comparison-based approaches, CEBin is able to provide an effective and efficient solution for detecting similar code (including vulnerable ones) in large-scale software ecosystems. Experimental results on three well-known datasets demonstrate the superiority of CEBin over existing state-of-the-art (SOTA) baselines. To further evaluate the usefulness of BCSD in real world, we construct a large-scale benchmark of vulnerability, offering the first precise evaluation scheme to assess BCSD methods for the 1-day vulnerability detection task. CEBin could identify the similar function from millions of candidate functions in just a few seconds and achieves an impressive recall rate of $85.46\%$ on this more practical but challenging task, which are several order of magnitudes faster and $4.07\times$ better than the best SOTA baseline. Our code is available at https://github.com/Hustcw/CEBin.
