Table of Contents
Fetching ...

A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer

Hanxiao Lu, Hongyu Cai, Yiming Liang, Antonio Bianchi, Z. Berkay Celik

TL;DR

ProTST tackles binary code understanding by replacing monolithic two-stage MLM pre-training with a progressive teacher-student hierarchy that transfers knowledge from foundational to advanced binary tasks. It combines an embedding module, a RoBERTa-based backbone, and task-specific heads to learn rich, task-appropriate representations directly from raw bytes, enabling seven diverse binary analyses. Empirical results show consistent improvements over traditional and multimodal baselines, plus faster convergence and robustness to optimization levels and obfuscation. The approach reduces reliance on architecture-specific feature engineering and opens avenues for extending to assembly input and additional tasks, with public release of the model and framework.

Abstract

Language model approaches have recently been integrated into binary analysis tasks, such as function similarity detection and function signature recovery. These models typically employ a two-stage training process: pre-training via Masked Language Modeling (MLM) on machine code and fine-tuning for specific tasks. While MLM helps to understand binary code structures, it ignores essential code characteristics, including control and data flow, which negatively affect model generalization. Recent work leverages domain-specific features (e.g., control flow graphs and dynamic execution traces) in transformer-based approaches to improve binary code semantic understanding. However, this approach involves complex feature engineering, a cumbersome and time-consuming process that can introduce predictive uncertainty when dealing with stripped or obfuscated code, leading to a performance drop. In this paper, we introduce ProTST, a novel transformer-based methodology for binary code embedding. ProTST employs a hierarchical training process based on a unique tree-like structure, where knowledge progressively flows from fundamental tasks at the root to more specialized tasks at the leaves. This progressive teacher-student paradigm allows the model to build upon previously learned knowledge, resulting in high-quality embeddings that can be effectively leveraged for diverse downstream binary analysis tasks. The effectiveness of ProTST is evaluated in seven binary analysis tasks, and the results show that ProTST yields an average validation score (F1, MRR, and Recall@1) improvement of 14.8% compared to traditional two-stage training and an average validation score of 10.7% compared to multimodal two-stage frameworks.

A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer

TL;DR

ProTST tackles binary code understanding by replacing monolithic two-stage MLM pre-training with a progressive teacher-student hierarchy that transfers knowledge from foundational to advanced binary tasks. It combines an embedding module, a RoBERTa-based backbone, and task-specific heads to learn rich, task-appropriate representations directly from raw bytes, enabling seven diverse binary analyses. Empirical results show consistent improvements over traditional and multimodal baselines, plus faster convergence and robustness to optimization levels and obfuscation. The approach reduces reliance on architecture-specific feature engineering and opens avenues for extending to assembly input and additional tasks, with public release of the model and framework.

Abstract

Language model approaches have recently been integrated into binary analysis tasks, such as function similarity detection and function signature recovery. These models typically employ a two-stage training process: pre-training via Masked Language Modeling (MLM) on machine code and fine-tuning for specific tasks. While MLM helps to understand binary code structures, it ignores essential code characteristics, including control and data flow, which negatively affect model generalization. Recent work leverages domain-specific features (e.g., control flow graphs and dynamic execution traces) in transformer-based approaches to improve binary code semantic understanding. However, this approach involves complex feature engineering, a cumbersome and time-consuming process that can introduce predictive uncertainty when dealing with stripped or obfuscated code, leading to a performance drop. In this paper, we introduce ProTST, a novel transformer-based methodology for binary code embedding. ProTST employs a hierarchical training process based on a unique tree-like structure, where knowledge progressively flows from fundamental tasks at the root to more specialized tasks at the leaves. This progressive teacher-student paradigm allows the model to build upon previously learned knowledge, resulting in high-quality embeddings that can be effectively leveraged for diverse downstream binary analysis tasks. The effectiveness of ProTST is evaluated in seven binary analysis tasks, and the results show that ProTST yields an average validation score (F1, MRR, and Recall@1) improvement of 14.8% compared to traditional two-stage training and an average validation score of 10.7% compared to multimodal two-stage frameworks.

Paper Structure

This paper contains 26 sections, 23 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: (a) Traditional two-stage training on static code. (b) Two-stage training with high-level modalities (e.g., CFG, DFG, and execution traces).
  • Figure 2: The progressive teacher-student learning of ProTST. Tasks are hierarchically structured to leverage foundational knowledge for more complex tasks. Each task employs a transformer, with model weights serving as interfaces between adjacent nodes. The system operates solely on the raw byte sequence (address and assembly are shown for illustration only).
  • Figure 3: The model architecture of ProTST
  • Figure 4: The performance (Recall@1) of different models for binary code similarity detection with respect to pool size.
  • Figure 5: Comparative performance of ProTST and XDA on various binary analysis tasks across different optimization levels.
  • ...and 4 more figures