ByCAN: Reverse Engineering Controller Area Network (CAN) Messages from Bit to Byte Level

Xiaojie Lin; Baihe Ma; Xu Wang; Guangsheng Yu; Ying He; Ren Ping Liu; Wei Ni

ByCAN: Reverse Engineering Controller Area Network (CAN) Messages from Bit to Byte Level

Xiaojie Lin, Baihe Ma, Xu Wang, Guangsheng Yu, Ying He, Ren Ping Liu, Wei Ni

TL;DR

This work tackles the opacity of CAN message semantics by presenting ByCAN, a fully automated reverse-engineering system that decodes CAN frame payloads using both byte-level clustering and bit-level slicing, without relying on prior DBC knowledge. The method introduces new byte- and bit-level features, leverages DBSCAN for unsupervised clustering, and uses Dynamic Time Warping to align Dynamic CAN signals with OBD-II templates for descriptive labeling. Experimental results on real-world vehicle traces show ByCAN achieves higher slicing accuracy (80.21%), coverage (95.21%), and labeling accuracy (68.72%) than two leading RE systems, demonstrating robust performance across signal types and frame counts. The approach reduces manual effort in understanding in-vehicle networks and has practical implications for automotive cybersecurity research and applied diagnostics by enabling label extraction from largely opaque CAN data.

Abstract

As the primary standard protocol for modern cars, the Controller Area Network (CAN) is a critical research target for automotive cybersecurity threats and autonomous applications. As the decoding specification of CAN is a proprietary black-box maintained by Original Equipment Manufacturers (OEMs), conducting related research and industry developments can be challenging without a comprehensive understanding of the meaning of CAN messages. In this paper, we propose a fully automated reverse-engineering system, named ByCAN, to reverse engineer CAN messages. ByCAN outperforms existing research by introducing byte-level clusters and integrating multiple features at both byte and bit levels. ByCAN employs the clustering and template matching algorithms to automatically decode the specifications of CAN frames without the need for prior knowledge. Experimental results demonstrate that ByCAN achieves high accuracy in slicing and labeling performance, i.e., the identification of CAN signal boundaries and labels. In the experiments, ByCAN achieves slicing accuracy of 80.21%, slicing coverage of 95.21%, and labeling accuracy of 68.72% for general labels when analyzing the real-world CAN frames.

ByCAN: Reverse Engineering Controller Area Network (CAN) Messages from Bit to Byte Level

TL;DR

Abstract

Paper Structure (36 sections, 16 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 36 sections, 16 equations, 5 figures, 6 tables, 2 algorithms.

Introduction
Background
Controller Area Network
OBD-II Diagnostic Data
Collecting CAN Frames and OBD-II Diagnostic Data
CAN Database Container
DBSCAN Clustering Algorithm
Related Work
Proposed System
Features Identification
Flip Rate
Average Value
Distinct Value Ratio
Data Pre-processing
CAN Frame Grouping
...and 21 more sections

Figures (5)

Figure 1: Standard CAN frame format: 1-bit Start of Frame (SOF), 11-bit ID, 1-bit Remote Transmission Request (RTR), 1-bit Identifier Extension Bit (IDE), 1-bit Reverse R0, 4-bit Data Length Code (DLC), 0 to 8 byte data payload, 15-bit Cyclic Redundancy Check (CRC), 1-bit CRC delimiter, 1-bit Acknowledge (ACK), 1-bit ACK delimiter, and 7-bit End of Frame (EOF).
Figure 2: Sample DBC: Mazda 3, Year 2019, CAN ID of 0x01A openDBC. CAN signals may occupy more than one byte, e.g., engine speed and pedal gas, which can be found by the yellow $Dynamic$ CAN signals that take more than 8 bits consecutively. CAN signals often align with whole-byte offsets, such as the Dynamic signals aligned to the left and the Verification signals aligned to the right.
Figure 3: ByCAN system: In the data pre-processing procedure, CAN messages are grouped by CAN ID first. Then, the grouped CAN messages are reformatted into the trace $M_C$ and $m_C$ with CAN frames' data payloads segmented at byte level and bit level, respectively. In the signal-slicing procedure, the byte-level CAN signal features are extracted to deduce byte-level signal clusters. The bit-level CAN signal boundaries are further sliced within each byte-level cluster using proposed signal features at the bit level. In the signal labeling procedure, the sliced CAN signals are first labeled as general categories (i.e., Unused, Switch, Dynamic and Verification). Finally, the descriptive labels are identified by applying the template matching algorithm to measure the similarity between the Dynamic signals and OBD-II diagnostic messages.
Figure 4: Comparison of slicing accuracy and slicing coverage of different systems: The $y$-axis is the slicing accuracy $\zeta$ and the $x$-axis is CAN signal type in (a); the $y$-axis is the slicing coverage $\varpi$ and the $x$-axis is CAN signal type in (b); the $y$-axis is the slicing accuracy $\zeta$ and the $x$-axis is the number of CAN frames in (c); the $y$-axis is the slicing coverage $\varpi$ and the $x$-axis is the number of CAN frames in (d). Note that the Verification CAN signals represent both the Counter and Checksum signals.
Figure 5: Comparison of labeling accuracy of different systems: The $y$-axis is the labeling accuracy $\xi$ for all subplots. The $x$-axis is the CAN signal type in (a), and the $x$-axis is the number of CAN frames in (b) and (c). Subplot (b) gives the labeling accuracy including Unused CAN signals while subplot (c) gives the labeling accuracy excluding Unused CAN signals.

ByCAN: Reverse Engineering Controller Area Network (CAN) Messages from Bit to Byte Level

TL;DR

Abstract

ByCAN: Reverse Engineering Controller Area Network (CAN) Messages from Bit to Byte Level

Authors

TL;DR

Abstract

Table of Contents

Figures (5)