Table of Contents
Fetching ...

Identifying Non-Control Security-Critical Data through Program Dependence Learning

Zhilong Wang, Haizhou Wang, Hong Hu, Peng Liu

TL;DR

This work targets data-oriented attacks by automatic identification of security-critical non-control data. It combines concrete dynamic execution with customized program-dependence graphs and a Tree-LSTM architecture to learn long-range semantic patterns that distinguish critical from non-critical variables. The approach achieves about 0.91 F1 on a 6,000-sample PDG dataset and rediscovered 8 of 9 known variables, while identifying 80 candidates in Google FuzzBench and enabling 7 simulated PoCs in GDB. This demonstrates scalable automatic critical-variable discovery across real-world programs and diverse types, reducing manual effort for defense and assessment. The method offers practical impact for preemptive hardening and evaluation of data-oriented attack surfaces.

Abstract

As control-flow protection gets widely deployed, it is difficult for attackers to corrupt control-data and achieve control-flow hijacking. Instead, data-oriented attacks, which manipulate non-control data, have been demonstrated to be feasible and powerful. In data-oriented attacks, a fundamental step is to identify non-control, security-critical data. However, critical data identification processes are not scalable in previous works, because they mainly rely on tedious human efforts to identify critical data. To address this issue, we propose a novel approach that combines traditional program analysis with deep learning. At a higher level, by examining how analysts identify critical data, we first propose dynamic analysis algorithms to identify the program semantics (and features) that are correlated with the impact of a critical data. Then, motivated by the unique challenges in the critical data identification task, we formalize the distinguishing features and use customized program dependence graphs (PDG) to embed the features. Different from previous works using deep learning to learn basic program semantics, this paper adopts a special neural network architecture that can capture the long dependency paths (in the PDG), through which a critical variable propagates its impact. We have implemented a fully-automatic toolchain and conducted comprehensive evaluations. According to the evaluations, our model can achieve 90% accuracy. The toolchain uncovers 80 potential critical variables in Google FuzzBench. In addition, we demonstrate the harmfulness of the exploits using the identified critical variables by simulating 7 data-oriented attacks through GDB.

Identifying Non-Control Security-Critical Data through Program Dependence Learning

TL;DR

This work targets data-oriented attacks by automatic identification of security-critical non-control data. It combines concrete dynamic execution with customized program-dependence graphs and a Tree-LSTM architecture to learn long-range semantic patterns that distinguish critical from non-critical variables. The approach achieves about 0.91 F1 on a 6,000-sample PDG dataset and rediscovered 8 of 9 known variables, while identifying 80 candidates in Google FuzzBench and enabling 7 simulated PoCs in GDB. This demonstrates scalable automatic critical-variable discovery across real-world programs and diverse types, reducing manual effort for defense and assessment. The method offers practical impact for preemptive hardening and evaluation of data-oriented attack surfaces.

Abstract

As control-flow protection gets widely deployed, it is difficult for attackers to corrupt control-data and achieve control-flow hijacking. Instead, data-oriented attacks, which manipulate non-control data, have been demonstrated to be feasible and powerful. In data-oriented attacks, a fundamental step is to identify non-control, security-critical data. However, critical data identification processes are not scalable in previous works, because they mainly rely on tedious human efforts to identify critical data. To address this issue, we propose a novel approach that combines traditional program analysis with deep learning. At a higher level, by examining how analysts identify critical data, we first propose dynamic analysis algorithms to identify the program semantics (and features) that are correlated with the impact of a critical data. Then, motivated by the unique challenges in the critical data identification task, we formalize the distinguishing features and use customized program dependence graphs (PDG) to embed the features. Different from previous works using deep learning to learn basic program semantics, this paper adopts a special neural network architecture that can capture the long dependency paths (in the PDG), through which a critical variable propagates its impact. We have implemented a fully-automatic toolchain and conducted comprehensive evaluations. According to the evaluations, our model can achieve 90% accuracy. The toolchain uncovers 80 potential critical variables in Google FuzzBench. In addition, we demonstrate the harmfulness of the exploits using the identified critical variables by simulating 7 data-oriented attacks through GDB.

Paper Structure

This paper contains 26 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Differentcategoriesofdatainprogramspace.
  • Figure 2: CodepiecesdependentontwovariableDefaultRootandSystemLogfromProFTPDconfigfiles.DefaultRootrestrictsuserstoonlycertaindirectories,SystemLogspecifiesthepathtooutputthelogs.
  • Figure 3: Twoloopcontrolvariables.authenticatedinOpenSSHisanauthenticationflag;resinProFTPDdenoteswhetherafilewrittingsuccess.
  • Figure 4: Approachoverview.
  • Figure 5: Dependencegraphsandtreebuiltfromtheexecutiontraceof\ref{['code:bit']}forvariableaclp.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1: Security-criticalData
  • Definition 2: CustomizedDDG
  • Definition 3: CustomizedCDG