Table of Contents
Fetching ...

STRIDE: Simple Type Recognition In Decompiled Executables

Harrison Green, Edward J. Schwartz, Claire Le Goues, Bogdan Vasilescu

TL;DR

STRIDE tackles the challenge of inferring variable types and names from decompiled executables, where essential source-code information is missing. It introduces a non-neural, N-gram–based post-processor that leverages surrounding decompiler token context to predict labels by matching usage signatures against a training corpus. Across three benchmarks (DIRT, DIRE, VarCorpus), STRIDE achieves competitive or superior accuracy to state-of-the-art transformer models while offering substantial speedups and CPU-only operation, aided by a compact implementation and a hashed N-gram database. The work highlights a practical, scalable alternative to large models and discusses future directions that blend STRIDE’s strengths with advances in language-model techniques for decompilation tasks. The open-source release at https://github.com/hgarrereyn/STRIDE enables integration with decompilers in real-world reverse engineering workflows.

Abstract

Decompilers are widely used by security researchers and developers to reverse engineer executable code. While modern decompilers are adept at recovering instructions, control flow, and function boundaries, some useful information from the original source code, such as variable types and names, is lost during the compilation process. Our work aims to predict these variable types and names from the remaining information. We propose STRIDE, a lightweight technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data. We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming while being much simpler and faster. We perform a detailed comparison with two recent SOTA transformer-based models in order to understand the specific factors that make our technique effective. We implemented STRIDE in fewer than 1000 lines of Python and have open-sourced it under a permissive license at https://github.com/hgarrereyn/STRIDE.

STRIDE: Simple Type Recognition In Decompiled Executables

TL;DR

STRIDE tackles the challenge of inferring variable types and names from decompiled executables, where essential source-code information is missing. It introduces a non-neural, N-gram–based post-processor that leverages surrounding decompiler token context to predict labels by matching usage signatures against a training corpus. Across three benchmarks (DIRT, DIRE, VarCorpus), STRIDE achieves competitive or superior accuracy to state-of-the-art transformer models while offering substantial speedups and CPU-only operation, aided by a compact implementation and a hashed N-gram database. The work highlights a practical, scalable alternative to large models and discusses future directions that blend STRIDE’s strengths with advances in language-model techniques for decompilation tasks. The open-source release at https://github.com/hgarrereyn/STRIDE enables integration with decompilers in real-world reverse engineering workflows.

Abstract

Decompilers are widely used by security researchers and developers to reverse engineer executable code. While modern decompilers are adept at recovering instructions, control flow, and function boundaries, some useful information from the original source code, such as variable types and names, is lost during the compilation process. Our work aims to predict these variable types and names from the remaining information. We propose STRIDE, a lightweight technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data. We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming while being much simpler and faster. We perform a detailed comparison with two recent SOTA transformer-based models in order to understand the specific factors that make our technique effective. We implemented STRIDE in fewer than 1000 lines of Python and have open-sourced it under a permissive license at https://github.com/hgarrereyn/STRIDE.
Paper Structure (25 sections, 7 figures, 8 tables)

This paper contains 25 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Hex-Rays decompilation (with names and types)
  • Figure 2: Hex-Rays decompilation (without names and types)
  • Figure 3: Decompilation provides contextual hints even with wrong names and types.
  • Figure 4: Running example ground truth types. When a dict_t * is mistyped as an int *, the field accesses look different.
  • Figure 5: All of the 4-grams for variables found in \ref{['fig:hexrays-stripped']} and two examples of N-gram normalization and hashing.
  • ...and 2 more figures