Table of Contents
Fetching ...

LiDAR-Event Stereo Fusion with Hallucinations

Luca Bartolomei, Matteo Poggi, Andrea Conti, Stefano Mattoccia

TL;DR

The paper tackles depth estimation with event cameras, which suffer from sparse and semi-dense data in textureless or motionless regions. It introduces LiDAR-assisted fusion through two hallucination-based mechanisms, Virtual Stack Hallucination (VSH) and Back-in-Time Hallucination (BTH), to inject depth hints either into stacked event representations or directly into event histories, preserving the microsecond resolution of events. Across DSEC and M3ED datasets, VSH and BTH consistently outperform RGB-LiDAR fusion baselines, with BTH often achieving the best 1PE and MAE, and demonstrate robustness to non-synchronized LiDAR data. The work advances practical, high-precision depth estimation for fast-motion scenarios by leveraging sparse depth cues without sacrificing temporal fidelity, and it provides a general framework applicable to multiple stacked representations.

Abstract

Event stereo matching is an emerging technique to estimate depth from neuromorphic cameras; however, events are unlikely to trigger in the absence of motion or the presence of large, untextured regions, making the correspondence problem extremely challenging. Purposely, we propose integrating a stereo event camera with a fixed-frequency active sensor -- e.g., a LiDAR -- collecting sparse depth measurements, overcoming the aforementioned limitations. Such depth hints are used by hallucinating -- i.e., inserting fictitious events -- the stacks or raw input streams, compensating for the lack of information in the absence of brightness changes. Our techniques are general, can be adapted to any structured representation to stack events and outperform state-of-the-art fusion methods applied to event-based stereo.

LiDAR-Event Stereo Fusion with Hallucinations

TL;DR

The paper tackles depth estimation with event cameras, which suffer from sparse and semi-dense data in textureless or motionless regions. It introduces LiDAR-assisted fusion through two hallucination-based mechanisms, Virtual Stack Hallucination (VSH) and Back-in-Time Hallucination (BTH), to inject depth hints either into stacked event representations or directly into event histories, preserving the microsecond resolution of events. Across DSEC and M3ED datasets, VSH and BTH consistently outperform RGB-LiDAR fusion baselines, with BTH often achieving the best 1PE and MAE, and demonstrate robustness to non-synchronized LiDAR data. The work advances practical, high-precision depth estimation for fast-motion scenarios by leveraging sparse depth cues without sacrificing temporal fidelity, and it provides a general framework applicable to multiple stacked representations.

Abstract

Event stereo matching is an emerging technique to estimate depth from neuromorphic cameras; however, events are unlikely to trigger in the absence of motion or the presence of large, untextured regions, making the correspondence problem extremely challenging. Purposely, we propose integrating a stereo event camera with a fixed-frequency active sensor -- e.g., a LiDAR -- collecting sparse depth measurements, overcoming the aforementioned limitations. Such depth hints are used by hallucinating -- i.e., inserting fictitious events -- the stacks or raw input streams, compensating for the lack of information in the absence of brightness changes. Our techniques are general, can be adapted to any structured representation to stack events and outperform state-of-the-art fusion methods applied to event-based stereo.
Paper Structure (14 sections, 2 equations, 9 figures, 4 tables)

This paper contains 14 sections, 2 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: LiDAR-Event Stereo Fusion with Hallucinations. In the absence of motion or brightness changes, sparse event streams lead stereo models to catastrophic failures (a). A LiDAR sensor can be used with existing strategies poggi2019guided to soften this problem, yet with limited impact (b), whereas our proposals are superior (c,d).
  • Figure 2: Event cameras vs LiDARs -- strengths and weaknesses. Event cameras provide rich cues at object boundaries where LiDARs cannot (cyan), yet LiDARs can measure depth where the lack of texture makes event cameras uninformative (green).
  • Figure 3: Overview of a generic event-based stereo network and our hallucination strategies. State-of-the-art event-stereo frameworks (a) pre-process raw events to obtain event stacks fed to a deep network. In case the stacks are accessible, we define the model as a gray box, otherwise as a black box. In the former case (b), we can hallucinate patterns directly on it (VSH). When dealing with a black box (c), we can hallucinate raw events that will be processed to obtain the stacks (BTH).
  • Figure 4: Overview of Back-in-Time Hallucination (BTH). To estimate disparity at $t_d$, if LiDAR data is available -- e.g., at timestamp $t_z=t_d$ (green) or $t_z=t_d-15$ (yellow) -- we can naïvely inject events of random polarities at the same timestamp $t_z$ (a). More advanced injection strategies can be used -- e.g. by hallucinating multiple events, starting from $t_d$, back-in-time at regular intervals (b).
  • Figure 5: Qualitative comparison -- DSEC vs M3ED. DSEC features $640\times480$ event cameras and a 16-line LiDAR, M3ED has $1280\times720$ event cameras and a 64-line LiDAR. LiDAR scans have been dilated with a $7\times7$ kernel to ease visualization.
  • ...and 4 more figures