Table of Contents
Fetching ...

reAnalyst: Scalable Annotation of Reverse Engineering Activities

Tab Zhang, Claire Taylor, Bart Coppens, Waleed Mebane, Christian Collberg, Bjorn De Sutter

TL;DR

This work addresses the challenge of scalable, objective annotation of reverse engineering activities by extending a tool-agnostic data-collection framework (RevEngE) with reAnalyst, which automatically extracts annotations such as function, basic block, and navigation events from live data streams. The approach combines OCR-based analysis of screenshots, symbol mappings from RE tools, and keystroke/mouse data to produce time-stamped annotations, supplemented by manual and session-level context. Evaluation across student datasets and a public RE challenge demonstrates high reliability (e.g., 97.7% function-annotation accuracy, 95.9% basic-block accuracy) and broad acceptability among participants, with ethical safeguards and privacy controls. The framework is open-source and designed to enable deeper, more realistic studies of RE practices and the effectiveness of protections, while reducing the labor burden of annotation for large-scale experiments.

Abstract

This paper introduces reAnalyst, a framework designed to facilitate the study of reverse engineering (RE) practices through the semi-automated annotation of RE activities across various RE tools. By integrating tool-agnostic data collection of screenshots, keystrokes, active processes, and other types of data during RE experiments with semi-automated data analysis and generation of annotations, reAnalyst aims to overcome the limitations of traditional RE studies that rely heavily on manual data collection and subjective analysis. The framework enables more efficient data analysis, which will in turn allow researchers to explore the effectiveness of protection techniques and strategies used by reverse engineers more comprehensively and efficiently. Experimental evaluations validate the framework's capability to identify RE activities from a diverse range of screenshots with varied complexities. Observations on past experiments with our framework as well as a survey among reverse engineers provide further evidence of the acceptability and practicality of our approach.

reAnalyst: Scalable Annotation of Reverse Engineering Activities

TL;DR

This work addresses the challenge of scalable, objective annotation of reverse engineering activities by extending a tool-agnostic data-collection framework (RevEngE) with reAnalyst, which automatically extracts annotations such as function, basic block, and navigation events from live data streams. The approach combines OCR-based analysis of screenshots, symbol mappings from RE tools, and keystroke/mouse data to produce time-stamped annotations, supplemented by manual and session-level context. Evaluation across student datasets and a public RE challenge demonstrates high reliability (e.g., 97.7% function-annotation accuracy, 95.9% basic-block accuracy) and broad acceptability among participants, with ethical safeguards and privacy controls. The framework is open-source and designed to enable deeper, more realistic studies of RE practices and the effectiveness of protections, while reducing the labor burden of annotation for large-scale experiments.

Abstract

This paper introduces reAnalyst, a framework designed to facilitate the study of reverse engineering (RE) practices through the semi-automated annotation of RE activities across various RE tools. By integrating tool-agnostic data collection of screenshots, keystrokes, active processes, and other types of data during RE experiments with semi-automated data analysis and generation of annotations, reAnalyst aims to overcome the limitations of traditional RE studies that rely heavily on manual data collection and subjective analysis. The framework enables more efficient data analysis, which will in turn allow researchers to explore the effectiveness of protection techniques and strategies used by reverse engineers more comprehensively and efficiently. Experimental evaluations validate the framework's capability to identify RE activities from a diverse range of screenshots with varied complexities. Observations on past experiments with our framework as well as a survey among reverse engineers provide further evidence of the acceptability and practicality of our approach.
Paper Structure (47 sections, 12 figures, 5 tables)

This paper contains 47 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Simulated timeline view, showing different types of annotations, such as those indicating which form of analysis was being deployed, and which functions (main, target, bridge) were being displayed in a tool at which times.
  • Figure 2: Sample of the taxonomy by Ceccato et al. Ceccato2017Ceccato2019.
  • Figure 3: Simulated animation view, which replays the collected screenshots along with recorded keystroke input ("license"), the current task annotation generated after function annotation ("Target Function"), and the top active processes with their CPU usage percentages. Researchers can use the buttons on the right to play, pause, or manually add task annotations to the timeline.
  • Figure 4: Sample keystroke data snippet showing combined individual keystrokes into keyboard inputs.
  • Figure 5: An example screenshot of IDA Pro showing seven basic blocks marked A-G of a function’s CFG. Some of the blocks are only shown partially. Still, the framework identified all but basic block C correctly, which has only a minimal portion visible.
  • ...and 7 more figures