ViC: Virtual Compiler Is All You Need For Assembly Code Search

Zeyu Gao; Hao Wang; Yuanda Wang; Chao Zhang

ViC: Virtual Compiler Is All You Need For Assembly Code Search

Zeyu Gao, Hao Wang, Yuanda Wang, Chao Zhang

TL;DR

This work tackles the dataset scarcity and compilation complexity hindering assembly code search by training a large language model to emulate a general compiler, creating ViC. By compiling Ubuntu packages and generating a massive source-to-assembly dataset, ViC is fine-tuned to virtually emit assembly code and augment a cross-language assembly search dataset. Through contrastive learning with augmented data and a decoupled encoder architecture, the approach delivers substantial improvements over baselines in assembly code search, and demonstrates meaningful similarity to real compiler outputs in both sequence and semantic terms. The results suggest that virtual compilation can dramatically broaden the language coverage and practical utility of assembly code search for reverse engineering and security tasks, while highlighting considerations around data quality and ethical use.

Abstract

Assembly code search is vital for reducing the burden on reverse engineers, allowing them to quickly identify specific functions using natural language within vast binary programs. Despite its significance, this critical task is impeded by the complexities involved in building high-quality datasets. This paper explores training a Large Language Model (LLM) to emulate a general compiler. By leveraging Ubuntu packages to compile a dataset of 20 billion tokens, we further continue pre-train CodeLlama as a Virtual Compiler (ViC), capable of compiling any source code of any language to assembly code. This approach allows for virtual compilation across a wide range of programming languages without the need for a real compiler, preserving semantic equivalency and expanding the possibilities for assembly code dataset construction. Furthermore, we use ViC to construct a sufficiently large dataset for assembly code search. Employing this extensive dataset, we achieve a substantial improvement in assembly code search performance, with our model surpassing the leading baseline by 26%.

ViC: Virtual Compiler Is All You Need For Assembly Code Search

TL;DR

Abstract

Paper Structure (35 sections, 5 equations, 6 figures, 4 tables)

This paper contains 35 sections, 5 equations, 6 figures, 4 tables.

Introduction
Background and Related Works
Assembly Code Analysis
Assembly Code Modeling
Code Search
Overview
Virtual Compiler
Code Dataset Construction
Model Training
Code Search Contrastive Learning
Dataset
Model Architecture
Assembly Code Encoder Training
Evaluation
Evaluation Setup
...and 20 more sections

Figures (6)

Figure 1: C source code and the compiled assembly code for a bubble sort algorithm.
Figure 2: The workflow overview of using ViC for assembly code search.
Figure 3: Correlation between the number of tokens used in model training and the quality of generated assembly code, as evaluated by various metrics
Figure 4: Comparison of assembly code from the (a) real compiler and (b) virtual compiler. Mismatches are highlighted in green (address differences), cyan (alternate operation expressions), yellow (register allocation variances), and grey (stack allocation discrepancies).
Figure 5: Example Golang function Covariance calculating the statistical covariance between two data sets, demonstrating input validation, mean calculation, and the sum of squares.
...and 1 more figures

ViC: Virtual Compiler Is All You Need For Assembly Code Search

TL;DR

Abstract

ViC: Virtual Compiler Is All You Need For Assembly Code Search

Authors

TL;DR

Abstract

Table of Contents

Figures (6)