Table of Contents
Fetching ...

Pluvio: Assembly Clone Search for Out-of-domain Architectures and Libraries through Transfer Learning and Conditional Variational Information Bottleneck

Zhiwei Fu, Steven H. H. Ding, Furkan Alaca, Benjamin C. M. Fung, Philippe Charland

TL;DR

This paper proposes incorporating human common knowledge through large-scale pre-trained natural language models, in the form of transfer learning, into current learning-based approaches for assembly clone search, and addresses the sequence limit issue by proposing a reinforcement learning agent to remove unnecessary and redundant tokens.

Abstract

The practice of code reuse is crucial in software development for a faster and more efficient development lifecycle. In reality, however, code reuse practices lack proper control, resulting in issues such as vulnerability propagation and intellectual property infringements. Assembly clone search, a critical shift-right defence mechanism, has been effective in identifying vulnerable code resulting from reuse in released executables. Recent studies on assembly clone search demonstrate a trend towards using machine learning-based methods to match assembly code variants produced by different toolchains. However, these methods are limited to what they learn from a small number of toolchain variants used in training, rendering them inapplicable to unseen architectures and their corresponding compilation toolchain variants. This paper presents the first study on the problem of assembly clone search with unseen architectures and libraries. We propose incorporating human common knowledge through large-scale pre-trained natural language models, in the form of transfer learning, into current learning-based approaches for assembly clone search. Transfer learning can aid in addressing the limitations of the existing approaches, as it can bring in broader knowledge from human experts in assembly code. We further address the sequence limit issue by proposing a reinforcement learning agent to remove unnecessary and redundant tokens. Coupled with a new Variational Information Bottleneck learning strategy, the proposed system minimizes the reliance on potential indicators of architectures and optimization settings, for a better generalization of unseen architectures. We simulate the unseen architecture clone search scenarios and the experimental results show the effectiveness of the proposed approach against the state-of-the-art solutions.

Pluvio: Assembly Clone Search for Out-of-domain Architectures and Libraries through Transfer Learning and Conditional Variational Information Bottleneck

TL;DR

This paper proposes incorporating human common knowledge through large-scale pre-trained natural language models, in the form of transfer learning, into current learning-based approaches for assembly clone search, and addresses the sequence limit issue by proposing a reinforcement learning agent to remove unnecessary and redundant tokens.

Abstract

The practice of code reuse is crucial in software development for a faster and more efficient development lifecycle. In reality, however, code reuse practices lack proper control, resulting in issues such as vulnerability propagation and intellectual property infringements. Assembly clone search, a critical shift-right defence mechanism, has been effective in identifying vulnerable code resulting from reuse in released executables. Recent studies on assembly clone search demonstrate a trend towards using machine learning-based methods to match assembly code variants produced by different toolchains. However, these methods are limited to what they learn from a small number of toolchain variants used in training, rendering them inapplicable to unseen architectures and their corresponding compilation toolchain variants. This paper presents the first study on the problem of assembly clone search with unseen architectures and libraries. We propose incorporating human common knowledge through large-scale pre-trained natural language models, in the form of transfer learning, into current learning-based approaches for assembly clone search. Transfer learning can aid in addressing the limitations of the existing approaches, as it can bring in broader knowledge from human experts in assembly code. We further address the sequence limit issue by proposing a reinforcement learning agent to remove unnecessary and redundant tokens. Coupled with a new Variational Information Bottleneck learning strategy, the proposed system minimizes the reliance on potential indicators of architectures and optimization settings, for a better generalization of unseen architectures. We simulate the unseen architecture clone search scenarios and the experimental results show the effectiveness of the proposed approach against the state-of-the-art solutions.
Paper Structure (18 sections, 19 equations, 4 figures, 7 tables)

This paper contains 18 sections, 19 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: When deploying assembly clone search engines, out-of-domain issues arise. This occurs when certain architectures and libraries are not accessible during the training process, which is limited by access to compilers and the difficulty of encompassing all potential architectures. The objective is to acquire knowledge from a specific set of architectures and libraries, and then apply it to previously unseen ones.
  • Figure 2: The Structural Overview of Pluvio. A pair of assembly instruction sequences $I_a$ and $I_b$ are fed into the MPNet tokenizer that outputs the tokens ids ($T_a$ and $T_b$) and attention masks ($M_a$ and $M_b$) for both instructions. Then, the Removal agent model will select the optimal tokens by removing the noise tokens from functions inlining and compiler-injected code. Afterwards, based on embeddings $E_a$ and $E_b$ created by the MPNet Embedder, a CVIB Encoder with conditions $l_o$ and $l_a$ will generate instruction encodings $Z_a$ and $z_b$ by removing the nuisance information from the optimizations and architectures. At last, the semantic search method is adopted to calculate the similarity score sscore of two encodings.
  • Figure 3: The structural Illustration of the MPNet Model. The leftmost inputs are non-predicted tokens and the rightmost tokens are selected as predicted tokens. The middle masks are the mask tokens of the predicted tokens. All the tokens along with their corresponding positions are fed into the model. Utilizing the MLM and PLM methods and the two-stream self-attention technologies, the MPNet model extracts representations via a bidirectional model.
  • Figure 4: The Structural Illustration of the Removal Agent Model. The token ids and attention mask tokens are fed into the Removal agent model consisting of an embedding layer, a convolutional layer, and a softmax activation function. To remove the noise tokens of the inlining functions and injected code in instructions, the agent model selects top-k tokens and adds them to the sentence features as the output.