Table of Contents
Fetching ...

FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion

Pihai Sun, Junjun Jiang, Yuanqi Yao, Youyu Chen, Wenbo Zhao, Kui Jiang, Xianming Liu

TL;DR

FUSE tackles two core challenges in image–event depth estimation: scarce cross-modal supervision and frequency-domain mismatches. It introduces PST to transfer image-depth priors to the image–event domain using a two-stage, parameter-efficient adaptation with LoRA adapters, and FreDFuse to decouple and fuse high-frequency event cues with low-frequency image structure through a Gaussian–Laplacian pyramid and cross-attention. The approach achieves state-of-the-art performance on MVSEC and DENSE, with strong zero-shot robustness under challenging lighting and motion conditions, while significantly reducing trainable parameters compared to full fine-tuning. This work enables robust, scalable depth perception in dynamic environments and points to future improvements in native asynchronous processing and model compression for real-time deployment.

Abstract

Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground truth. Complementing this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in Abs .Rel on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: https://github.com/sunpihai-up/FUSE

FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion

TL;DR

FUSE tackles two core challenges in image–event depth estimation: scarce cross-modal supervision and frequency-domain mismatches. It introduces PST to transfer image-depth priors to the image–event domain using a two-stage, parameter-efficient adaptation with LoRA adapters, and FreDFuse to decouple and fuse high-frequency event cues with low-frequency image structure through a Gaussian–Laplacian pyramid and cross-attention. The approach achieves state-of-the-art performance on MVSEC and DENSE, with strong zero-shot robustness under challenging lighting and motion conditions, while significantly reducing trainable parameters compared to full fine-tuning. This work enables robust, scalable depth perception in dynamic environments and points to future improvements in native asynchronous processing and model compression for real-time deployment.

Abstract

Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground truth. Complementing this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in Abs .Rel on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: https://github.com/sunpihai-up/FUSE

Paper Structure

This paper contains 17 sections, 11 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Demonstration of FUSE's superior performance in challenging conditions. (a) and (c) show the input image and event data, with blue/red in (c) indicating brightness decrease/increase. When image data fails due to low light and blur, event data provides complementary dynamic information. Our method (d) leverages multimodal synergy, recovering the traffic light missed by the state-of-the-art image depth model depthanything2, as highlighted in the orange box.
  • Figure 2: Overview of the FUSE framework. FUSE integrates an image encoder, event encoder, Frequency-Decoupled Fusion module (FreDFuse), and depth decoder. The image encoder and depth decoder are initialized with a pre-trained MDE model. A two-stage knowledge transfer strategy fine-tunes the event encoder and FreDFuse. In Stage I, the event encoder is initialized with the image encoder’s weights, and only the LoRA matrix and Patch Embed are fine-tuned using clean event data. In Stage 2, randomly degraded image-event pairs are used, and only the FreDFuse is fine-tuned. The MDE model supervises both stages with clean image data in the output and latent spaces.
  • Figure 3: Overview of our FreDFuse. FreDFuse decouples image features $\mathbf{F}_I$ and event features $\mathbf{F}_E$ into high- and low-frequency components using a Gaussian-Laplacian pyramid. Multi-scale features are fused top-down with $1 \times 1$ group convolutions and channel shuffle. Fusion in the high-frequency branch is event-driven, while the low-frequency branch is image-driven. The high- and low-frequency components are combined through addition and processed by LayerNorm.
  • Figure 4: Qualitative analysis of the MVSEC dataset, outdoor_day1 scene. (a) and (b) show the input image and event data; (c) depicts the image-event joint estimation by HMNethmnet; (d) shows the event-only estimation by HMNet; (e) presents our proposed FUSE; (f) shows the depth ground truth.
  • Figure 5: Qualitative analysis of the MVSEC dataset, outdoor_night1 scene. (a) and (b) show the input image and event data; (c) depicts the image-event joint estimation by HMNethmnet; (d) shows the event-only estimation by HMNet; (e) presents our proposed FUSE; (f) shows the depth ground truth.
  • ...and 1 more figures