Table of Contents
Fetching ...

$x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space

Ruishan Guo, Ciyu Ruan, Haoyang Wang, Zihang Gong, Jingao Xu, Xinlei Chen

Abstract

Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex.Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce $x^2$-Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation.Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that $x^2$-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.

$x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space

Abstract

Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex.Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce -Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation.Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that -Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.
Paper Structure (27 sections, 20 equations, 8 figures, 3 tables)

This paper contains 27 sections, 20 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Fusion paradigms for multimodal perception. (a) RPEFlow wan2023rpeflow: stage-wise fusion across 2D/3D spaces. (b) CMX zhang2023cmx:image-like encodings with pairwise rectification/attention at each backbone stage. (c) VisMoFlow zhou2024bring: separate luminance/structure/correlation spaces with dedicated modules. (d) Ours: preserves native domains and learns an Event Edge Space, a shared edge-centric latent space guided by a frozen event teacher, that aligns image, LiDAR, and events before fusion.
  • Figure 2: Overview of $x^2$-Fusion. Given image, events and LiDAR, we first pretrain an Event Edge Encoder to distill motion-aware edge features. We then freeze the encoder and use its embeddings as edge prototypes to symmetrically regularize channel-wise representations across modalities, aligning them into a shared Event Edge Space (Sec. \ref{['sec:event_edge_space']}). Within this space, Reliability-aware Adaptive Fusion estimates global and local reliability and fuses modalities via a cross-attention block to produce 2D/3D features (Sec. \ref{['sec:event_aware_adaption_fusion']}). Finally, Cross-dimension Contrast Learning enforces inter-frame coherence and 2D–3D consistency, and the task heads output optical and scene flow (Sec. \ref{['sec:CCL']}).
  • Figure 3: Our proposed event edge encoder pretraining learns explicit, high-fidelity motion-aware edge representations by predicting edge strength from voxelized event streams.
  • Figure 4: The proposed reliability-aware adaptive fusion module adaptively integrates image, LiDAR, and event features through hierarchical reliability weighting and cross-modal attention.
  • Figure 5: Visual comparison of optical flows on EKubric and DSEC dataset. $x^2$-Fusion achieves the state-of-the-art performance across various exposure degradation scenarios with clearer motion boundaries and finer details. Please zoom in for details.
  • ...and 3 more figures