Table of Contents
Fetching ...

A Comprehensive Guide to CAN IDS Data & Introduction of the ROAD Dataset

Miki E. Verma, Robert A. Bridges, Michael D. Iannacone, Samuel C. Hollifield, Pablo Moriano, Steven C. Hespeler, Bill Kay, Frank L. Combs

TL;DR

The paper addresses a critical bottleneck in CAN intrusion detection research: the lack of high-fidelity, open benchmarking data. It provides a comprehensive survey of existing CAN IDS datasets, quality assessments, and a principled framework for dataset selection. The Real ORNL Automotive Dynamometer (ROAD) dataset is introduced as a richly annotated, real-vehicle CAN data source with a spectrum of attacks, including stealthy fabrication, masquerade simulations, and advanced attacks, plus signal-translated inputs to support diverse detector designs. ROAD, together with the dataset guide, aims to standardize benchmarking, improve reproducibility, and accelerate development of robust CAN IDS methods. While ROAD makes substantial strides, the authors acknowledge remaining gaps, notably real masquerade data and richer physical-layer inputs, outlining a clear path for future data collection and standardization efforts.

Abstract

Although ubiquitous in modern vehicles, Controller Area Networks (CANs) lack basic security properties and are easily exploitable. A rapidly growing field of CAN security research has emerged that seeks to detect intrusions on CANs. Producing vehicular CAN data with a variety of intrusions is out of reach for most researchers as it requires expensive assets and expertise. To assist researchers, we present the first comprehensive guide to the existing open CAN intrusion datasets, including a quality analysis of each dataset and an enumeration of each's benefits, drawbacks, and suggested use case. Current public CAN IDS datasets are limited to real fabrication (simple message injection) attacks and simulated attacks often in synthetic data, which lack fidelity. In general, the physical effects of attacks on the vehicle are not verified in the available datasets. Only one dataset provides signal-translated data but not a corresponding raw binary version. Overall, the available data pigeon-holes CAN IDS works into testing on limited, often inappropriate data (usually with attacks that are too easily detectable to truly test the method), and this lack data has stymied comparability and reproducibility of results. As our primary contribution, we present the ROAD (Real ORNL Automotive Dynamometer) CAN Intrusion Dataset, consisting of over 3.5 hours of one vehicle's CAN data. ROAD contains ambient data recorded during a diverse set of activities, and attacks of increasing stealth with multiple variants and instances of real fuzzing, fabrication, and unique advanced attacks, as well as simulated masquerade attacks. To facilitate benchmarking CAN IDS methods that require signal-translated inputs, we also provide the signal time series format for many of the CAN captures. Our contributions aim to facilitate appropriate benchmarking and needed comparability in the CAN IDS field.

A Comprehensive Guide to CAN IDS Data & Introduction of the ROAD Dataset

TL;DR

The paper addresses a critical bottleneck in CAN intrusion detection research: the lack of high-fidelity, open benchmarking data. It provides a comprehensive survey of existing CAN IDS datasets, quality assessments, and a principled framework for dataset selection. The Real ORNL Automotive Dynamometer (ROAD) dataset is introduced as a richly annotated, real-vehicle CAN data source with a spectrum of attacks, including stealthy fabrication, masquerade simulations, and advanced attacks, plus signal-translated inputs to support diverse detector designs. ROAD, together with the dataset guide, aims to standardize benchmarking, improve reproducibility, and accelerate development of robust CAN IDS methods. While ROAD makes substantial strides, the authors acknowledge remaining gaps, notably real masquerade data and richer physical-layer inputs, outlining a clear path for future data collection and standardization efforts.

Abstract

Although ubiquitous in modern vehicles, Controller Area Networks (CANs) lack basic security properties and are easily exploitable. A rapidly growing field of CAN security research has emerged that seeks to detect intrusions on CANs. Producing vehicular CAN data with a variety of intrusions is out of reach for most researchers as it requires expensive assets and expertise. To assist researchers, we present the first comprehensive guide to the existing open CAN intrusion datasets, including a quality analysis of each dataset and an enumeration of each's benefits, drawbacks, and suggested use case. Current public CAN IDS datasets are limited to real fabrication (simple message injection) attacks and simulated attacks often in synthetic data, which lack fidelity. In general, the physical effects of attacks on the vehicle are not verified in the available datasets. Only one dataset provides signal-translated data but not a corresponding raw binary version. Overall, the available data pigeon-holes CAN IDS works into testing on limited, often inappropriate data (usually with attacks that are too easily detectable to truly test the method), and this lack data has stymied comparability and reproducibility of results. As our primary contribution, we present the ROAD (Real ORNL Automotive Dynamometer) CAN Intrusion Dataset, consisting of over 3.5 hours of one vehicle's CAN data. ROAD contains ambient data recorded during a diverse set of activities, and attacks of increasing stealth with multiple variants and instances of real fuzzing, fabrication, and unique advanced attacks, as well as simulated masquerade attacks. To facilitate benchmarking CAN IDS methods that require signal-translated inputs, we also provide the signal time series format for many of the CAN captures. Our contributions aim to facilitate appropriate benchmarking and needed comparability in the CAN IDS field.

Paper Structure

This paper contains 27 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Papers published in peer reviewed journals based on: a) yearly trend of CAN IDS research and b) frequency of of CAN IDS category; 1) frequency/timing-based, 2) payload-based, 3) signal-based, 4) physical side-channel, and 5) other.
  • Figure 2: CAN data frame: The two primary components are the Arbitration ID used for message identification and arbitration (prioritizing messages) and the Data Field, containing up to 8 bytes of message contents.
  • Figure 3: The time gap between subsequent messages (all messages on the CAN capture of any ID are included) are plotted over time during fuzzing attacks on four different vehicles, with top three plots from each vehicle in the HCRL Survival Analysis Dataset, and the bottom plot from the ORNL dataset. While the injections (in red) result in a significant disruption in the overall message timings in the HCRL dataset, the fuzzing attack in the ORNL dataset does not, and would therefore be slightly more difficult to detect using a timing-based IDS. This also illustrates that the bus load and overall message frequency distribution varies widely across vehicles.
  • Figure 4: HCRL Car Hacking Dataset contains unintentional artifacts of data collection; in particular, in each of the four attack datasets, right after conclusion of the attack, there is a prolonged period during which no messages appear on the bus. This depicts the end of the DoS dataset, starting from the last four injection intervals (red), followed by $\mathtt{\sim}$53s of ambient traffic (blue), and a $\mathtt{\sim}$ 22s transmission gap before ambient message resume again. Note that the first point after messages resume (with a $\Delta t \approx 22.4s$) has been omitted for scale. We hypothesize that this gap is due the CAN bus going into a "stand-by" mode due to inactivity, that is, the vehicle is not being operated and no messages are being injected.
  • Figure 5: Snippet of metadata for two example captures, with an example of ambient (left) and attack (right) entries.