Table of Contents
Fetching ...

BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis

Fei Zuo, Cody Tompkins, Qiang Zeng, Lannan Luo, Yung Ryn Choe, Junghwan Rhee

TL;DR

A benchmark dataset for fine-grained binary code similarity analysis called BinSimDB is constructed, which contains equivalent pairs of smaller binary code snippets, such as basic blocks, and BMerge and BPair algorithms are proposed to bridge the discrepancies between two binary code snippets caused by different optimization levels or platforms.

Abstract

Binary Code Similarity Analysis (BCSA) has a wide spectrum of applications, including plagiarism detection, vulnerability discovery, and malware analysis, thus drawing significant attention from the security community. However, conventional techniques often face challenges in balancing both accuracy and scalability simultaneously. To overcome these existing problems, a surge of deep learning-based work has been recently proposed. Unfortunately, many researchers still find it extremely difficult to conduct relevant studies or extend existing approaches. First, prior work typically relies on proprietary benchmark without making the entire dataset publicly accessible. Consequently, a large-scale, well-labeled dataset for binary code similarity analysis remains precious and scarce. Moreover, previous work has primarily focused on comparing at the function level, rather than exploring other finer granularities. Therefore, we argue that the lack of a fine-grained dataset for BCSA leaves a critical gap in current research. To address these challenges, we construct a benchmark dataset for fine-grained binary code similarity analysis called BinSimDB, which contains equivalent pairs of smaller binary code snippets, such as basic blocks. Specifically, we propose BMerge and BPair algorithms to bridge the discrepancies between two binary code snippets caused by different optimization levels or platforms. Furthermore, we empirically study the properties of our dataset and evaluate its effectiveness for the BCSA research. The experimental results demonstrate that BinSimDB significantly improves the performance of binary code similarity comparison.

BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis

TL;DR

A benchmark dataset for fine-grained binary code similarity analysis called BinSimDB is constructed, which contains equivalent pairs of smaller binary code snippets, such as basic blocks, and BMerge and BPair algorithms are proposed to bridge the discrepancies between two binary code snippets caused by different optimization levels or platforms.

Abstract

Binary Code Similarity Analysis (BCSA) has a wide spectrum of applications, including plagiarism detection, vulnerability discovery, and malware analysis, thus drawing significant attention from the security community. However, conventional techniques often face challenges in balancing both accuracy and scalability simultaneously. To overcome these existing problems, a surge of deep learning-based work has been recently proposed. Unfortunately, many researchers still find it extremely difficult to conduct relevant studies or extend existing approaches. First, prior work typically relies on proprietary benchmark without making the entire dataset publicly accessible. Consequently, a large-scale, well-labeled dataset for binary code similarity analysis remains precious and scarce. Moreover, previous work has primarily focused on comparing at the function level, rather than exploring other finer granularities. Therefore, we argue that the lack of a fine-grained dataset for BCSA leaves a critical gap in current research. To address these challenges, we construct a benchmark dataset for fine-grained binary code similarity analysis called BinSimDB, which contains equivalent pairs of smaller binary code snippets, such as basic blocks. Specifically, we propose BMerge and BPair algorithms to bridge the discrepancies between two binary code snippets caused by different optimization levels or platforms. Furthermore, we empirically study the properties of our dataset and evaluate its effectiveness for the BCSA research. The experimental results demonstrate that BinSimDB significantly improves the performance of binary code similarity comparison.

Paper Structure

This paper contains 19 sections, 3 equations, 11 figures, 3 tables, 2 algorithms.

Figures (11)

  • Figure 1: A patch in gzip for CVE-2010-0001.
  • Figure 2: A pair of equivalent binary code snippets from the same source code.
  • Figure 3: Basic block annotation with corresponding source file and line numbers.
  • Figure 4: Binary code in AArch64 obtained by using O3 as the optimization level.
  • Figure 5: System overview.
  • ...and 6 more figures