Table of Contents
Fetching ...

MMDVS-LF: Multi-Modal Dynamic Vision Sensor and Eye-Tracking Dataset for Line Following

Felix Resch, Mónika Farsang, Radu Grosu

TL;DR

MMDVS-LF introduces a multimodal, compact dataset for line following that combines Dynamic Vision Sensor (DVS) events with eye-tracking, RGB video, odometry, IMU, and driver demographics. It emphasizes synchronized multi-modal data and representations such as time surfaces and event frames to enable event-based deep learning for control tasks, validated through steering-prediction benchmarks and attention-map analyses. The dataset supports various resolutions and frequencies, provides detailed recording/annotation formats, and demonstrates potential for broader tasks (e.g., control, driver identification) beyond simple steering. This resource aims to promote trustworthy, interpretable, and efficient development of DVS-based models and end-to-end learning pipelines on accessible hardware like roboracer platforms.

Abstract

Dynamic Vision Sensors (DVS) offer a unique advantage in control applications due to their high temporal resolution and asynchronous event-based data. Still, their adoption in machine learning algorithms remains limited. To address this gap and promote the development of models that leverage the specific characteristics of DVS data, we introduce the MMDVS-LF: Multi-Modal Dynamic Vision Sensor and Eye-Tracking Dataset for Line Following. This comprehensive dataset is the first to integrate multiple sensor modalities, including DVS recordings and eye-tracking data from a small-scale standardized vehicle. Additionally, the dataset includes RGB video, odometry, Inertial Measurement Unit (IMU) data, and demographic data of drivers performing a Line Following. With its diverse range of data, MMDVS-LF opens new opportunities for developing event-based deep learning algorithms just like the MNIST dataset did for Convolutional Neural Networks.

MMDVS-LF: Multi-Modal Dynamic Vision Sensor and Eye-Tracking Dataset for Line Following

TL;DR

MMDVS-LF introduces a multimodal, compact dataset for line following that combines Dynamic Vision Sensor (DVS) events with eye-tracking, RGB video, odometry, IMU, and driver demographics. It emphasizes synchronized multi-modal data and representations such as time surfaces and event frames to enable event-based deep learning for control tasks, validated through steering-prediction benchmarks and attention-map analyses. The dataset supports various resolutions and frequencies, provides detailed recording/annotation formats, and demonstrates potential for broader tasks (e.g., control, driver identification) beyond simple steering. This resource aims to promote trustworthy, interpretable, and efficient development of DVS-based models and end-to-end learning pipelines on accessible hardware like roboracer platforms.

Abstract

Dynamic Vision Sensors (DVS) offer a unique advantage in control applications due to their high temporal resolution and asynchronous event-based data. Still, their adoption in machine learning algorithms remains limited. To address this gap and promote the development of models that leverage the specific characteristics of DVS data, we introduce the MMDVS-LF: Multi-Modal Dynamic Vision Sensor and Eye-Tracking Dataset for Line Following. This comprehensive dataset is the first to integrate multiple sensor modalities, including DVS recordings and eye-tracking data from a small-scale standardized vehicle. Additionally, the dataset includes RGB video, odometry, Inertial Measurement Unit (IMU) data, and demographic data of drivers performing a Line Following. With its diverse range of data, MMDVS-LF opens new opportunities for developing event-based deep learning algorithms just like the MNIST dataset did for Convolutional Neural Networks.
Paper Structure (19 sections, 7 figures, 3 tables)

This paper contains 19 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Recording setup for dataset recording. The human driver views the RGB stream while wearing an eye-tracking headset and controlling the vehicle remotely.
  • Figure 2: RGB frame in different corresponding DVS data representations. In the time surface and the event tensor, darker colors indicate earlier events and lighter colors later ones.
  • Figure 3: Temporal synchronization points between the three main temporal frames and annotated eye-tracking stream with annotated ArUco markers and RGB stream. The blue dot in the eye-tracking frame represents the gaze of the participant.
  • Figure 4: Distribution of driving inputs, such as steering angle and acceleration command from the human drivers and speed measured by odometry.
  • Figure 5: For benchmarking, we use the following architecture: sequences of time surface data are created and fed into our neural networks. These networks consist of a CNN Block with several convolutional and max pooling layers, followed by a flattening layer. These features are fed into a fully-connected RNN, which predicts the sequence of steering commands corresponding to the input. Before and after the RNN block, we use additional input and output mappings. For analysis, we apply the VisualBackProp method to extract the attention maps of the trained models. These are compared to the human attention from the eye-tracking data available in our dataset.
  • ...and 2 more figures