Table of Contents
Fetching ...

Call graph discovery in binary programs from unknown instruction set architectures

Håvard Pettersen, Donn Morrison

TL;DR

This work tackles reverse engineering of binaries with unknown instruction set architectures by introducing a heuristic pipeline to detect candidate call and return opcodes and to construct call graphs. It defines the Opcode Candidacy Probability Score (OCP-Score) to rank opcodes based on static cues such as absolute/relative addressing and nearby epilogue patterns, enabling automated yet ranked extraction of call graphs. The method is evaluated on a small multi-ISA dataset (including Chip8, MIPS, AArch64, OpenVPN, cURL, Chipquarium), showing promising opcode detection and plausible graphs under fixed-length ISAs, while highlighting limitations for variable-length architectures and noisy data. The work provides a practical, low-dependency tool to assist reverse engineers and security analysts, with future directions including broader ISA support, NOP/disambiguation, and larger-scale validation.

Abstract

This study addresses the challenge of reverse engineering binaries from unknown instruction set architectures, a complex task with potential implications for software maintenance and cyber-security. We focus on the tasks of detecting candidate call and return opcodes for automatic extraction of call graphs in order to simplify the reverse engineering process. Empirical testing on a small dataset of binary files from different architectures demonstrates that the approach can accurately detect specific opcodes under conditions of noisy data. The method lays the groundwork for a valuable tool for reverse engineering where the reverse engineer has minimal a priori knowledge of the underlying instruction set architecture.

Call graph discovery in binary programs from unknown instruction set architectures

TL;DR

This work tackles reverse engineering of binaries with unknown instruction set architectures by introducing a heuristic pipeline to detect candidate call and return opcodes and to construct call graphs. It defines the Opcode Candidacy Probability Score (OCP-Score) to rank opcodes based on static cues such as absolute/relative addressing and nearby epilogue patterns, enabling automated yet ranked extraction of call graphs. The method is evaluated on a small multi-ISA dataset (including Chip8, MIPS, AArch64, OpenVPN, cURL, Chipquarium), showing promising opcode detection and plausible graphs under fixed-length ISAs, while highlighting limitations for variable-length architectures and noisy data. The work provides a practical, low-dependency tool to assist reverse engineers and security analysts, with future directions including broader ISA support, NOP/disambiguation, and larger-scale validation.

Abstract

This study addresses the challenge of reverse engineering binaries from unknown instruction set architectures, a complex task with potential implications for software maintenance and cyber-security. We focus on the tasks of detecting candidate call and return opcodes for automatic extraction of call graphs in order to simplify the reverse engineering process. Empirical testing on a small dataset of binary files from different architectures demonstrates that the approach can accurately detect specific opcodes under conditions of noisy data. The method lays the groundwork for a valuable tool for reverse engineering where the reverse engineer has minimal a priori knowledge of the underlying instruction set architecture.
Paper Structure (14 sections, 1 equation, 12 figures, 10 tables, 2 algorithms)

This paper contains 14 sections, 1 equation, 12 figures, 10 tables, 2 algorithms.

Figures (12)

  • Figure 1: ELF file structure wikielf.
  • Figure 2: Call graph constructed from a program containing a main function which calls Function 1 and Function 2.
  • Figure 3: Context of the use of the proposed solution, occurring between architectural feature extraction and sub-component scanning.
  • Figure 4: User interface of the frontend solution, showing the different pages for uploading a binary file, entering parameters, and displaying the generated call graph.
  • Figure 5: OCP-Score for different inputs of the instructionLength parameter, shown for the cURL and OpenVPN binaries in the MIPS and Aarch64 architectures.
  • ...and 7 more figures