Table of Contents
Fetching ...

Understanding the AI-powered Binary Code Similarity Detection

Lirong Fu, Peiyu Liu, Wenlong Meng, Kangjie Lu, Shize Zhou, Xuhong Zhang, Wenzhi Chen, Shouling Ji

TL;DR

This paper tackles the challenge of fairly evaluating AI-powered BinSD methods amid heterogeneous embedding strategies and evaluation setups. It conducts a systematic, static, function-level study across similar-function detection and two downstream tasks (vulnerability search and license violation detection), re-implementing and harmonizing a set of representative approaches on two public datasets. The analysis reveals that GNN-based embeddings achieve strong performance in similar-function detection but suffer from embedding collision, while downstream tasks show varying suitability across models; ROC/AUC alone often fails to reflect practical effectiveness. The study also highlights methodological sensitivities, such as repository construction and function renaming, and proposes concrete directions (embedding concatenation, graph alignment) to address current limitations, with open-source artifacts to accelerate future BinSD research.

Abstract

AI-powered binary code similarity detection (BinSD), which transforms intricate binary code comparison to the distance measure of code embedding through neural networks, has been widely applied to program analysis. However, due to the diversity of the adopted embedding strategies, evaluation methodologies, running environments, and/or benchmarks, it is difficult to quantitatively understand to what extent the BinSD problem has been solved, especially in realworld applications. Moreover, the lack of an in-depth investigation of the increasingly complex embedding neural networks and various evaluation methodologies has become the key factor hindering the development of AI-powered BinSD. To fill these research gaps, in this paper, we present a systematic evaluation of state-of-the-art AI-powered BinSD approaches by conducting a comprehensive comparison of BinSD systems on similar function detection and two downstream applications, namely vulnerability search and license violation detection. Building upon this evaluation, we perform the first investigation of embedding neural networks and evaluation methodologies. The experimental results yield several findings, which provide valuable insights in the BinSD domain, including (1) despite the GNN-based BinSD systems currently achieving the best performance in similar function detection, there still exists considerable space for improvements;(2) the capability of AI-powered BinSD approaches exhibits significant variation when applied to different downstream applications;(3) existing evaluation methodologies still need substantial adjustments. For instance, the evaluation metrics (such as the widely adopted ROC and AUC) usually fall short of accurately representing the model performance of the practical use in realworld scenarios. Based on the extensive experiments and analysis, we further provide several promising future research directions.

Understanding the AI-powered Binary Code Similarity Detection

TL;DR

This paper tackles the challenge of fairly evaluating AI-powered BinSD methods amid heterogeneous embedding strategies and evaluation setups. It conducts a systematic, static, function-level study across similar-function detection and two downstream tasks (vulnerability search and license violation detection), re-implementing and harmonizing a set of representative approaches on two public datasets. The analysis reveals that GNN-based embeddings achieve strong performance in similar-function detection but suffer from embedding collision, while downstream tasks show varying suitability across models; ROC/AUC alone often fails to reflect practical effectiveness. The study also highlights methodological sensitivities, such as repository construction and function renaming, and proposes concrete directions (embedding concatenation, graph alignment) to address current limitations, with open-source artifacts to accelerate future BinSD research.

Abstract

AI-powered binary code similarity detection (BinSD), which transforms intricate binary code comparison to the distance measure of code embedding through neural networks, has been widely applied to program analysis. However, due to the diversity of the adopted embedding strategies, evaluation methodologies, running environments, and/or benchmarks, it is difficult to quantitatively understand to what extent the BinSD problem has been solved, especially in realworld applications. Moreover, the lack of an in-depth investigation of the increasingly complex embedding neural networks and various evaluation methodologies has become the key factor hindering the development of AI-powered BinSD. To fill these research gaps, in this paper, we present a systematic evaluation of state-of-the-art AI-powered BinSD approaches by conducting a comprehensive comparison of BinSD systems on similar function detection and two downstream applications, namely vulnerability search and license violation detection. Building upon this evaluation, we perform the first investigation of embedding neural networks and evaluation methodologies. The experimental results yield several findings, which provide valuable insights in the BinSD domain, including (1) despite the GNN-based BinSD systems currently achieving the best performance in similar function detection, there still exists considerable space for improvements;(2) the capability of AI-powered BinSD approaches exhibits significant variation when applied to different downstream applications;(3) existing evaluation methodologies still need substantial adjustments. For instance, the evaluation metrics (such as the widely adopted ROC and AUC) usually fall short of accurately representing the model performance of the practical use in realworld scenarios. Based on the extensive experiments and analysis, we further provide several promising future research directions.

Paper Structure

This paper contains 21 sections, 3 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: We change the way of building the searching repository by including different ratios of query functions in the constructed repository. (a) and (b) show how metric values change under different ratios.
  • Figure 2: We perform AUC calculation on the test dataset. Each point in (a), (b), and (c) presents the similarity score of each function pair in the test dataset. (d), (e) and (f) show the top-10 function search results in which the query functions are the first functions in the function pairs stored in the test dataset. Each point in (d), (e), and (f) present the similarity score between the query function and one of the search functions in the top-10 search result.