Table of Contents
Fetching ...

Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary

Xiangzhe Xu, Zhuo Zhang, Zian Su, Ziyang Huang, Shiwei Feng, Yapeng Ye, Nan Jiang, Danning Xie, Siyuan Cheng, Lin Tan, Xiangyu Zhang

TL;DR

GenNm tackles the challenge of recovering meaningful variable names from fully stripped binaries by treating it as a generative, context-aware task rather than a closed-vocabulary classification problem. It fine-tunes pre-trained code-language models on decompiled code with local and contextual inputs, and introduces Symbol Preference Optimization to align model outputs with developer naming preferences. Inference is performed iteratively along the program call graph, using context propagation and a name-validation step to ensure cross-function semantic consistency. Across two large datasets, GenNm yields consistent improvements over state-of-the-art baselines, including substantial gains when ground-truth names are unseen during training, and it outperforms black-box LLMs in precision and semantic relevance. The work demonstrates that combining context-aware fine-tuning, symbol-preference guidance, and iterative, graph-based context augmentation significantly enhances variable-name recovery, with clear implications for malware analysis and binary understanding.

Abstract

Decompilation aims to recover the source code form of a binary executable. It has many security applications, such as malware analysis, vulnerability detection, and code hardening. A prominent challenge in decompilation is to recover variable names. We propose a novel technique that leverages the strengths of generative models while mitigating model biases. We build a prototype, GenNm, from pre-trained generative models CodeGemma-2B, CodeLlama-7B, and CodeLlama-34B. We finetune GenNm on decompiled functions and teach models to leverage contextual information. GenNm includes names from callers and callees while querying a function, providing rich contextual information within the model's input token limitation. We mitigate model biases by aligning the output distribution of models with symbol preferences of developers. Our results show that GenNm improves the state-of-the-art name recovery precision by 5.6-11.4 percentage points on two commonly used datasets and improves the state-of-the-art by 32% (from 17.3% to 22.8%) in the most challenging setup where ground-truth variable names are not seen in the training dataset.

Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary

TL;DR

GenNm tackles the challenge of recovering meaningful variable names from fully stripped binaries by treating it as a generative, context-aware task rather than a closed-vocabulary classification problem. It fine-tunes pre-trained code-language models on decompiled code with local and contextual inputs, and introduces Symbol Preference Optimization to align model outputs with developer naming preferences. Inference is performed iteratively along the program call graph, using context propagation and a name-validation step to ensure cross-function semantic consistency. Across two large datasets, GenNm yields consistent improvements over state-of-the-art baselines, including substantial gains when ground-truth names are unseen during training, and it outperforms black-box LLMs in precision and semantic relevance. The work demonstrates that combining context-aware fine-tuning, symbol-preference guidance, and iterative, graph-based context augmentation significantly enhances variable-name recovery, with clear implications for malware analysis and binary understanding.

Abstract

Decompilation aims to recover the source code form of a binary executable. It has many security applications, such as malware analysis, vulnerability detection, and code hardening. A prominent challenge in decompilation is to recover variable names. We propose a novel technique that leverages the strengths of generative models while mitigating model biases. We build a prototype, GenNm, from pre-trained generative models CodeGemma-2B, CodeLlama-7B, and CodeLlama-34B. We finetune GenNm on decompiled functions and teach models to leverage contextual information. GenNm includes names from callers and callees while querying a function, providing rich contextual information within the model's input token limitation. We mitigate model biases by aligning the output distribution of models with symbol preferences of developers. Our results show that GenNm improves the state-of-the-art name recovery precision by 5.6-11.4 percentage points on two commonly used datasets and improves the state-of-the-art by 32% (from 17.3% to 22.8%) in the most challenging setup where ground-truth variable names are not seen in the training dataset.
Paper Structure (50 sections, 11 equations, 20 figures, 10 tables, 1 algorithm)

This paper contains 50 sections, 11 equations, 20 figures, 10 tables, 1 algorithm.

Figures (20)

  • Figure 1: Code snippets for the motivating example. Corresponding variables are highlighted with same colors.
  • Figure 2: Name selections for baseline (VarBERT) and name distributions for the predictions of GenNm. Each column denotes the predictions of a technique. VarBERT denotes the baseline model, GenNm-SymPO denotes the GenNm model after fine-tuning and symbol preference optimization. +Context denotes the model is used with the contextual information propagated along the call graph. Blue, pink, and yellow colors denote predictions for v29, s, and n. Names are ranked by their probability where a longer bar denotes a higher probability. Names highlighted with bold fonts are names similar or equal to ground-truth names. Names with outlines are those selected by the name validation algorithm.
  • Figure 3: Distribution of name frequencies. More than 50% variable names (in orange) appear only once in the training dataset.
  • Figure 4: Query prompt to GenNm augmented with the information propagated from the calling context (the green box). Dataflow used in name validation are indicated by green arrows, with most relevant ones highlighted.
  • Figure 5: Formal definitions of the problem
  • ...and 15 more figures