Table of Contents
Fetching ...

StrTune: Data Dependence-based Code Slicing for Binary Similarity Detection with Fine-tuned Representation

Kaiyan He, Yikun Hu, Xuehui Li, Yunhao Song, Yubo Zhao, Dawu Gu

TL;DR

STRTUNE is proposed, which slices binary code based on data dependence and perform slice-level fine-tuning and utilizes a Siamese Network to fine-tune such pairs, making their representations closer in the feature space.

Abstract

Binary Code Similarity Detection (BCSD) is significant for software security as it can address binary tasks such as malicious code snippets identification and binary patch analysis by comparing code patterns. Recently, there has been a growing focus on artificial intelligence-based approaches in BCSD due to their scalability and generalization. Because binaries are compiled with different compilation configurations, existing approaches still face notable limitations when comparing binary similarity. First, BCSD requires analysis on code behavior, and existing work claims to extract semantic, but actually still makes analysis in terms of syntax. Second, directly extracting features from assembly sequences, existing work cannot address the issues of instruction reordering and different syntax expressions caused by various compilation configurations. In this paper, we propose StrTune, which slices binary code based on data dependence and perform slice-level fine-tuning. To address the first limitation, StrTune performs backward slicing based on data dependence to capture how a value is computed along the execution. Each slice reflects the collecting semantics of the code, which is stable across different compilation configurations. StrTune introduces flow types to emphasize the independence of computations between slices, forming a graph representation. To overcome the second limitation, based on slices corresponding to the same value computation but having different syntax representation, StrTune utilizes a Siamese Network to fine-tune such pairs, making their representations closer in the feature space.

StrTune: Data Dependence-based Code Slicing for Binary Similarity Detection with Fine-tuned Representation

TL;DR

STRTUNE is proposed, which slices binary code based on data dependence and perform slice-level fine-tuning and utilizes a Siamese Network to fine-tune such pairs, making their representations closer in the feature space.

Abstract

Binary Code Similarity Detection (BCSD) is significant for software security as it can address binary tasks such as malicious code snippets identification and binary patch analysis by comparing code patterns. Recently, there has been a growing focus on artificial intelligence-based approaches in BCSD due to their scalability and generalization. Because binaries are compiled with different compilation configurations, existing approaches still face notable limitations when comparing binary similarity. First, BCSD requires analysis on code behavior, and existing work claims to extract semantic, but actually still makes analysis in terms of syntax. Second, directly extracting features from assembly sequences, existing work cannot address the issues of instruction reordering and different syntax expressions caused by various compilation configurations. In this paper, we propose StrTune, which slices binary code based on data dependence and perform slice-level fine-tuning. To address the first limitation, StrTune performs backward slicing based on data dependence to capture how a value is computed along the execution. Each slice reflects the collecting semantics of the code, which is stable across different compilation configurations. StrTune introduces flow types to emphasize the independence of computations between slices, forming a graph representation. To overcome the second limitation, based on slices corresponding to the same value computation but having different syntax representation, StrTune utilizes a Siamese Network to fine-tune such pairs, making their representations closer in the feature space.

Paper Structure

This paper contains 24 sections, 8 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: ① shows instructions of the function uninit_options with its corresponding computational contents in ②. ③ shows the general method that current approaches use to process the instruction token. ④ involves the prologue of the function uninit_options.
  • Figure 2: IR of the uninit_options function compiled with O0 and O3, and our ways to slice the instructions, corresponding to the computational contents in the same color.
  • Figure 3: Example of our graph representation, consisting of four types of flow: ① Sequential flow, ② Data parallel flow, ③ Data dependence flow and ④ Jump flow.
  • Figure 4: The overview of StrTune consisting of three components: ① Graph Construction that builds novel graphs base on IR lifted from binaries. ② Node Representation that forms a two-step learning process for similar slices, including pre-training and fine-tuning of RoBERTa model. ③ Structure training that performs labeled learning for graph pairs based on the node representations given by RoBERTa, utilizing attention coefficients cross graphs. StrTune forms a binary similarity model which takes two functions as input and outputs a similarity score.
  • Figure 5: Graph Construction of our model StrTune: ① shows unprocessed Microcode in a basic block. After removal and preservation, we obtain the remaining instructions in ②, where instructions of the same color exhibiting data dependence, are considered as one code slice. Then we add specific flow type between slices based on the rules mentioned before, as shown in ③.
  • ...and 6 more figures