Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning

Nan Jiang; Chengxiao Wang; Kevin Liu; Xiangzhe Xu; Lin Tan; Xiangyu Zhang; Petr Babkin

Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning

Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe Xu, Lin Tan, Xiangyu Zhang, Petr Babkin

TL;DR

Nova tackles assembly-specific challenges in binary analysis by introducing hierarchical self-attention and two contrastive learning objectives, enabling robust generation and understanding of assembly code. Built on a decoder-based foundation and pre-trained on ~4.3 million assembly functions, Nova achieves state-of-the-art results on binary code decompilation (Pass@1/Pass@10 gains) and binary code similarity detection (Recall@1 improvements), with ablation analyses confirming the value of each component. The approach relies on a three-tier attention scheme that represents instruction semantics via [INST] tokens and uses contrastive losses to align assembly with its source and across optimizations. This work demonstrates significant practical impact for binary analysis tasks and opens avenues for extending assembly-focused foundation models to multi-language contexts.

Abstract

Binary code analysis is the foundation of crucial tasks in the security domain; thus building effective binary analysis techniques is more important than ever. Large language models (LLMs) although have brought impressive improvement to source code tasks, do not directly generalize to assembly code due to the unique challenges of assembly: (1) the low information density of assembly and (2) the diverse optimizations in assembly code. To overcome these challenges, this work proposes a hierarchical attention mechanism that builds attention summaries to capture the semantics more effectively and designs contrastive learning objectives to train LLMs to learn assembly optimization. Equipped with these techniques, this work develops Nova, a generative LLM for assembly code. Nova outperforms existing techniques on binary code decompilation by up to 14.84 -- 21.58% (absolute percentage point improvement) higher Pass@1 and Pass@10, and outperforms the latest binary code similarity detection techniques by up to 6.17% Recall@1, showing promising abilities on both assembly generation and understanding tasks.

Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning

TL;DR

Abstract

Paper Structure (30 sections, 4 equations, 11 figures, 7 tables)

This paper contains 30 sections, 4 equations, 11 figures, 7 tables.

Introduction
Related Work
Binary Models
Large Source-Code Models
Attention Mechanism
Approach
Data Collection
Hierarchical Self-Attention
Contrastive Learning
Task 1: Binary Code Decompilation
Task 2: Binary Code Similarity Detection
Experimental Setup
Pre-Training
Fine-Tuning for Binary Code Decompilation
Fine-Tuning for Binary Code Similarity Detection
...and 15 more sections

Figures (11)

Figure 1: Example that shows the semantics and diverse optimizations of assembly code.
Figure 2: Overview of developing Nova
Figure 3: Design of Nova's hierarchical attention for assembly code
Figure 4: Design of functionality and optimization contrastive learning (CL). "asm" denotes assembly.
Figure 5: t-SNE analysis of embeddings calculated by Nova$_{\footnotesize -CL-HA}$, Nova$_{\footnotesize -HA}$, and Nova.
...and 6 more figures

Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning

TL;DR

Abstract

Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)