Automatic State Machine Inference for Binary Protocol Reverse Engineering
Junhai Yang, Fenghua Li, Yixuan Zhang, Junhao Zhang, Liang Fang, Yunchuan Guo
TL;DR
The paper tackles the challenge of protocol reverse engineering in mixed unknown traffic by targeting Protocol State Machine (PSM) inference. It introduces an automatic pipeline that first clusters protocol formats with a fuzzy membership-based auto-converging DBSCAN (ACDA), then clusters sessions via Needleman-Wunsch alignment and K-medoids, and finally refines probabilistic PSMs with total-probability-based noise suppression. Key contributions include the ACDA-based protocol format clustering, NW-KMedoids session clustering for mixed protocols, and a robust probabilistic PSM inference enhanced from Veritas. Experiments on TLSv1.2 and SMTP traces show high state/transition matching (SMC≈1.0, TMC≈0.86–1.0) and strong clustering performance, demonstrating effective PSM inference for both binary and text protocols in mixed environments.
Abstract
Protocol Reverse Engineering (PRE) is used to analyze protocols by inferring their structure and behavior. However, current PRE methods mainly focus on field identification within a single protocol and neglect Protocol State Machine (PSM) analysis in mixed protocol environments. This results in insufficient analysis of protocols' abnormal behavior and potential vulnerabilities, which are crucial for detecting and defending against new attack patterns. To address these challenges, we propose an automatic PSM inference framework for unknown protocols, including a fuzzy membership-based auto-converging DBSCAN algorithm for protocol format clustering, followed by a session clustering algorithm based on Needleman-Wunsch and K-Medoids algorithms to classify sessions by protocol type. Finally, we refine a probabilistic PSM algorithm to infer protocol states and the transition conditions between these states. Experimental results show that, compared with existing PRE techniques, our method can infer PSMs while enabling more precise classification of protocols.
