Table of Contents
Fetching ...

Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code

Gangyang Li, Xiuwei Shang, Shaoyin Cheng, Junqi Zhang, Li Hu, Xu Zhu, Weiming Zhang, Nenghai Yu

TL;DR

This paper addresses binary variable type recovery, a core step in reverse engineering, by revealing that real-world binary types follow distributional patterns and propagate across functions. It proposes ByteTR, a two-stage framework combining BytePA inter-procedural program analysis with ByteTP, a GGNN-based type predictor operating on a Variable Semantic Graph derived from a Variable Propagation Graph. Grounded in an extensive empirical study on the TYDA dataset, ByteTR decouples target types, traces cross-function propagation, and uses static analysis to cope with compiler optimizations, achieving state-of-the-art average precision (76.18%) and strong performance across architectures and optimization levels. Real-world CTF cases demonstrate improved readability over leading tools, underscoring ByteTR’s practical impact for binary analysis and reverse engineering.

Abstract

Type recovery is a crucial step in binary code analysis, holding significant importance for reverse engineering and various security applications. Existing works typically simply target type identifiers within binary code and achieve type recovery by analyzing variable characteristics within functions. However, we find that the types in real-world binary programs are more complex and often follow specific distribution patterns. In this paper, to gain a profound understanding of the variable type recovery problem in binary code, we first conduct a comprehensive empirical study. We utilize the TYDA dataset, which includes 163,643 binary programs across four architectures and four compiler optimization options, fully reflecting the complexity and diversity of real-world programs. We carefully study the unique patterns that characterize types and variables in binary code, and also investigate the impact of compiler optimizations on them, yielding many valuable insights. Based on our empirical findings, we propose ByteTR, a framework for recovering variable types in binary code. We decouple the target type set to address the issue of unbalanced type distribution and perform static program analysis to tackle the impact of compiler optimizations on variable storage. In light of the ubiquity of variable propagation across functions observed in our study, ByteTR conducts inter-procedural analysis to trace variable propagation and employs a gated graph neural network to capture long-range data flow dependencies for variable type recovery. We conduct extensive experiments to evaluate the performance of ByteTR. The results demonstrate that ByteTR leads state-of-the-art works in both effectiveness and efficiency. Moreover, in real CTF challenge case, the pseudo code optimized by ByteTR significantly improves readability, surpassing leading tools IDA and Ghidra.

Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code

TL;DR

This paper addresses binary variable type recovery, a core step in reverse engineering, by revealing that real-world binary types follow distributional patterns and propagate across functions. It proposes ByteTR, a two-stage framework combining BytePA inter-procedural program analysis with ByteTP, a GGNN-based type predictor operating on a Variable Semantic Graph derived from a Variable Propagation Graph. Grounded in an extensive empirical study on the TYDA dataset, ByteTR decouples target types, traces cross-function propagation, and uses static analysis to cope with compiler optimizations, achieving state-of-the-art average precision (76.18%) and strong performance across architectures and optimization levels. Real-world CTF cases demonstrate improved readability over leading tools, underscoring ByteTR’s practical impact for binary analysis and reverse engineering.

Abstract

Type recovery is a crucial step in binary code analysis, holding significant importance for reverse engineering and various security applications. Existing works typically simply target type identifiers within binary code and achieve type recovery by analyzing variable characteristics within functions. However, we find that the types in real-world binary programs are more complex and often follow specific distribution patterns. In this paper, to gain a profound understanding of the variable type recovery problem in binary code, we first conduct a comprehensive empirical study. We utilize the TYDA dataset, which includes 163,643 binary programs across four architectures and four compiler optimization options, fully reflecting the complexity and diversity of real-world programs. We carefully study the unique patterns that characterize types and variables in binary code, and also investigate the impact of compiler optimizations on them, yielding many valuable insights. Based on our empirical findings, we propose ByteTR, a framework for recovering variable types in binary code. We decouple the target type set to address the issue of unbalanced type distribution and perform static program analysis to tackle the impact of compiler optimizations on variable storage. In light of the ubiquity of variable propagation across functions observed in our study, ByteTR conducts inter-procedural analysis to trace variable propagation and employs a gated graph neural network to capture long-range data flow dependencies for variable type recovery. We conduct extensive experiments to evaluate the performance of ByteTR. The results demonstrate that ByteTR leads state-of-the-art works in both effectiveness and efficiency. Moreover, in real CTF challenge case, the pseudo code optimized by ByteTR significantly improves readability, surpassing leading tools IDA and Ghidra.

Paper Structure

This paper contains 46 sections, 10 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Variable type data fitting Zipf's and Heaps' laws where $p$ denotes statistical significance.
  • Figure 2: Statistical analysis of variable propagation. (a) denotes the proportion of the number of functions crossed by a single variable as it propagates along the data flow. (b) denotes the density distribution of the total number of variables and the number of variables that call other functions as arguments within a single function.
  • Figure 3: Variable storage patterns across different architectures and optimization options
  • Figure 4: Overview of ByteTR.
  • Figure 5: The full set of predictable types presented in the form of the BNF paradigm in our formulation.
  • ...and 5 more figures