Table of Contents
Fetching ...

EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes

Xiaoshan Wu, Yifei Yu, Xiaoyang Lyu, Yihua Huang, Bo Wang, Baoheng Zhang, Zhongrui Wang, Xiaojuan Qi

TL;DR

This work tackles robust 3D geometry estimation from video in dynamic, low-light environments by augmenting RGB-based pointmaps with asynchronous event streams. It introduces EAG3R, which combines a Retinex-inspired image enhancer with a SNR-guided fusion module and a Swin Transformer-based event adapter to fuse RGB and event features, augmented by an event-based photometric consistency loss in global optimization. The approach demonstrates strong zero-shot nighttime performance across monocular depth, camera pose tracking, and dynamic 4D reconstruction on MVSEC, outperforming RGB-only baselines with modest computational overhead. Overall, EAG3R highlights the value of multimodal event-RGB fusion for reliable 3D perception under challenging conditions.

Abstract

Robust 3D geometry estimation from videos is critical for applications such as autonomous navigation, SLAM, and 3D scene reconstruction. Recent methods like DUSt3R demonstrate that regressing dense pointmaps from image pairs enables accurate and efficient pose-free reconstruction. However, existing RGB-only approaches struggle under real-world conditions involving dynamic objects and extreme illumination, due to the inherent limitations of conventional cameras. In this paper, we propose EAG3R, a novel geometry estimation framework that augments pointmap-based reconstruction with asynchronous event streams. Built upon the MonST3R backbone, EAG3R introduces two key innovations: (1) a retinex-inspired image enhancement module and a lightweight event adapter with SNR-aware fusion mechanism that adaptively combines RGB and event features based on local reliability; and (2) a novel event-based photometric consistency loss that reinforces spatiotemporal coherence during global optimization. Our method enables robust geometry estimation in challenging dynamic low-light scenes without requiring retraining on night-time data. Extensive experiments demonstrate that EAG3R significantly outperforms state-of-the-art RGB-only baselines across monocular depth estimation, camera pose tracking, and dynamic reconstruction tasks.

EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes

TL;DR

This work tackles robust 3D geometry estimation from video in dynamic, low-light environments by augmenting RGB-based pointmaps with asynchronous event streams. It introduces EAG3R, which combines a Retinex-inspired image enhancer with a SNR-guided fusion module and a Swin Transformer-based event adapter to fuse RGB and event features, augmented by an event-based photometric consistency loss in global optimization. The approach demonstrates strong zero-shot nighttime performance across monocular depth, camera pose tracking, and dynamic 4D reconstruction on MVSEC, outperforming RGB-only baselines with modest computational overhead. Overall, EAG3R highlights the value of multimodal event-RGB fusion for reliable 3D perception under challenging conditions.

Abstract

Robust 3D geometry estimation from videos is critical for applications such as autonomous navigation, SLAM, and 3D scene reconstruction. Recent methods like DUSt3R demonstrate that regressing dense pointmaps from image pairs enables accurate and efficient pose-free reconstruction. However, existing RGB-only approaches struggle under real-world conditions involving dynamic objects and extreme illumination, due to the inherent limitations of conventional cameras. In this paper, we propose EAG3R, a novel geometry estimation framework that augments pointmap-based reconstruction with asynchronous event streams. Built upon the MonST3R backbone, EAG3R introduces two key innovations: (1) a retinex-inspired image enhancement module and a lightweight event adapter with SNR-aware fusion mechanism that adaptively combines RGB and event features based on local reliability; and (2) a novel event-based photometric consistency loss that reinforces spatiotemporal coherence during global optimization. Our method enables robust geometry estimation in challenging dynamic low-light scenes without requiring retraining on night-time data. Extensive experiments demonstrate that EAG3R significantly outperforms state-of-the-art RGB-only baselines across monocular depth estimation, camera pose tracking, and dynamic reconstruction tasks.

Paper Structure

This paper contains 51 sections, 11 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: EAG3R pipeline for event-augmented dynamic 3D reconstruction. EAG3R processes a low-light video and its corresponding event stream within a temporal window, extracting pairwise pointmaps for each frame pair. These pointmaps are jointly optimized under alignment, flow, smoothness, and event-based consistency losses to recover a global dynamic point cloud and per-frame camera poses and intrinsics $\{X, P, K\}$. This unified representation enables efficient downstream tasks such as depth estimation and camera pose estimation, under challenging lighting conditions.
  • Figure 2: EAG3R network. Left: The DUSt3R (MonST3R) architecture with reference and source views processed via ViT encoder-decoder structure. Middle: Our method (only the upstream branch for the reference image is shown), which includes a lightweight event encoder and fuses event and image features with cross-attention. Right: The Retinex-based enhancement module estimates an illumination map and an SNR confidence map to guide adaptive fusion.
  • Figure 3: Event-based photometric consistency loss. Harris corners are detected on the input image to define salient patches. Observed brightness increments are computed by integrating event polarities, while predicted increments are synthesized from image gradients and motion. The loss $\mathcal{L}_{\text{event}}$ measures their alignment.
  • Figure 4: Comparison of estimated camera trajectories. The predicted trajectories (solid blue) from DUS3R, MonST3R, and EAG3R are evaluated against the ground truth (dashed gray). Notably, EAG3R demonstrates a trajectory that more closely aligns with the ground truth.
  • Figure A.1: Qualitative comparison on dynamic scenes. Our method reconstructs consistent 3D geometry even when a fast-moving vehicle passes through the scene. RGB-only methods fail to capture this motion reliably.