reAnalyst: Scalable Annotation of Reverse Engineering Activities
Tab Zhang, Claire Taylor, Bart Coppens, Waleed Mebane, Christian Collberg, Bjorn De Sutter
TL;DR
This work addresses the challenge of scalable, objective annotation of reverse engineering activities by extending a tool-agnostic data-collection framework (RevEngE) with reAnalyst, which automatically extracts annotations such as function, basic block, and navigation events from live data streams. The approach combines OCR-based analysis of screenshots, symbol mappings from RE tools, and keystroke/mouse data to produce time-stamped annotations, supplemented by manual and session-level context. Evaluation across student datasets and a public RE challenge demonstrates high reliability (e.g., 97.7% function-annotation accuracy, 95.9% basic-block accuracy) and broad acceptability among participants, with ethical safeguards and privacy controls. The framework is open-source and designed to enable deeper, more realistic studies of RE practices and the effectiveness of protections, while reducing the labor burden of annotation for large-scale experiments.
Abstract
This paper introduces reAnalyst, a framework designed to facilitate the study of reverse engineering (RE) practices through the semi-automated annotation of RE activities across various RE tools. By integrating tool-agnostic data collection of screenshots, keystrokes, active processes, and other types of data during RE experiments with semi-automated data analysis and generation of annotations, reAnalyst aims to overcome the limitations of traditional RE studies that rely heavily on manual data collection and subjective analysis. The framework enables more efficient data analysis, which will in turn allow researchers to explore the effectiveness of protection techniques and strategies used by reverse engineers more comprehensively and efficiently. Experimental evaluations validate the framework's capability to identify RE activities from a diverse range of screenshots with varied complexities. Observations on past experiments with our framework as well as a survey among reverse engineers provide further evidence of the acceptability and practicality of our approach.
