Table of Contents
Fetching ...

eNavi: Event-based Imitation Policies for Low-Light Indoor Mobile Robot Navigation

Prithvi Jai Ramesh, Kaustav Chanda, Krishna Vinod, Joseph Raj Vishal, Yezhou Yang, Bharatesh Chakravarthi

Abstract

Event cameras provide high dynamic range and microsecond-level temporal resolution, making them well-suited for indoor robot navigation, where conventional RGB cameras degrade under fast motion or low-light conditions. Despite advances in event-based perception spanning detection, SLAM, and pose estimation, there remains limited research on end-to-end control policies that exploit the asynchronous nature of event streams. To address this gap, we introduce a real-world indoor person-following dataset collected using a TurtleBot 2 robot, featuring synchronized raw event streams, RGB frames, and expert control actions across multiple indoor maps, trajectories under both normal and low-light conditions. We further build a multimodal data preprocessing pipeline that temporally aligns event and RGB observations while reconstructing ground-truth actions from odometry to support high-quality imitation learning. Building on this dataset, we propose a late-fusion RGB-Event navigation policy that combines dual MobileNet encoders with a transformer-based fusion module trained via behavioral cloning. A systematic evaluation of RGB-only, Event-only, and RGB-Event fusion models across 12 training variations ranging from single-path imitation to general multi-path imitation shows that policies incorporating event data, particularly the fusion model, achieve improved robustness and lower action prediction error, especially in unseen low-light conditions where RGB-only models fail. We release the dataset, synchronization pipeline, and trained models at https://eventbasedvision.github.io/eNavi/

eNavi: Event-based Imitation Policies for Low-Light Indoor Mobile Robot Navigation

Abstract

Event cameras provide high dynamic range and microsecond-level temporal resolution, making them well-suited for indoor robot navigation, where conventional RGB cameras degrade under fast motion or low-light conditions. Despite advances in event-based perception spanning detection, SLAM, and pose estimation, there remains limited research on end-to-end control policies that exploit the asynchronous nature of event streams. To address this gap, we introduce a real-world indoor person-following dataset collected using a TurtleBot 2 robot, featuring synchronized raw event streams, RGB frames, and expert control actions across multiple indoor maps, trajectories under both normal and low-light conditions. We further build a multimodal data preprocessing pipeline that temporally aligns event and RGB observations while reconstructing ground-truth actions from odometry to support high-quality imitation learning. Building on this dataset, we propose a late-fusion RGB-Event navigation policy that combines dual MobileNet encoders with a transformer-based fusion module trained via behavioral cloning. A systematic evaluation of RGB-only, Event-only, and RGB-Event fusion models across 12 training variations ranging from single-path imitation to general multi-path imitation shows that policies incorporating event data, particularly the fusion model, achieve improved robustness and lower action prediction error, especially in unseen low-light conditions where RGB-only models fail. We release the dataset, synchronization pipeline, and trained models at https://eventbasedvision.github.io/eNavi/
Paper Structure (27 sections, 5 equations, 4 figures, 3 tables)

This paper contains 27 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the dataset generation workflow. (a) The process begins with teleoperated data collection using a mobile robot setup. (b) The collection process where the user is teleoperating the bot to follow another person across the room in three different paths ($P_1$, $P_2$, and $P_3$) in different lighting conditions.
  • Figure 2: Overview of the dataset generation workflow. The process begins with teleoperated data collection using a mobile robot setup. The recorded ROS $2$ bags are analyzed for trajectory and velocity consistency (center panels) before being synchronized into paired image, event, and action tuples $\{(\mathbf{I}_k, \mathbf{E}_k, \mathbf{a}_k)\}$ stored in .h5 format.
  • Figure 3: The Event-based Navigation Policy (ENP-Fusion) Architecture. The model fuses a frozen RGB stream and a trainable event stream using a Transformer-based attention mechanism. The system takes synchronized RGB frames ($320 \times 180 \times 3$ ) and event frames ($320 \times 180 \times 2$) as input and outputs continuous differential drive commands ($v, \omega$) via an MLP head. The snowflake icon denotes that the RGB backbone weights remain fixed during training to prevent overfitting in low-light scenarios.
  • Figure 4: Trajectory Analysis of ENP-Fusion Policy. Comparison of predicted linear velocity (m/s) and angular rate (rad/s) against expert ground truth for representative test trajectory ($P_3$). (a) and (b) contrast performance under normal-light (top) and low-light (down) conditions.