Table of Contents
Fetching ...

VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion

Linfeng Tang, Yeda Wang, Meiqi Gong, Zizhuo Li, Yuxin Deng, Xunpeng Yi, Chunyu Li, Han Xu, Hao Zhang, Jiayi Ma

TL;DR

VideoFusion is proposed, a multi-modal video fusion model that exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from multi-modal inputs and outperforms existing image-oriented fusion paradigms in sequences, effectively mitigating temporal inconsistency and interference.

Abstract

Compared to images, videos better reflect real-world acquisition and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos due to the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion and the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. To this end, we construct M3SVD, a benchmark dataset with $220$ temporally synchronized and spatially registered infrared-visible videos comprising $153,797$ frames, bridging the data gap. Secondly, we propose VideoFusion, a multi-modal video fusion model that exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from multi-modal inputs. Specifically, 1) a differential reinforcement module is developed for cross-modal information interaction and enhancement, 2) a complete modality-guided fusion strategy is employed to adaptively integrate multi-modal features, and 3) a bi-temporal co-attention mechanism is devised to dynamically aggregate forward-backward temporal contexts to reinforce cross-frame feature representations. Experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequences, effectively mitigating temporal inconsistency and interference. Project and M3SVD: https://github.com/Linfeng-Tang/VideoFusion.

VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion

TL;DR

VideoFusion is proposed, a multi-modal video fusion model that exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from multi-modal inputs and outperforms existing image-oriented fusion paradigms in sequences, effectively mitigating temporal inconsistency and interference.

Abstract

Compared to images, videos better reflect real-world acquisition and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos due to the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion and the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. To this end, we construct M3SVD, a benchmark dataset with temporally synchronized and spatially registered infrared-visible videos comprising frames, bridging the data gap. Secondly, we propose VideoFusion, a multi-modal video fusion model that exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from multi-modal inputs. Specifically, 1) a differential reinforcement module is developed for cross-modal information interaction and enhancement, 2) a complete modality-guided fusion strategy is employed to adaptively integrate multi-modal features, and 3) a bi-temporal co-attention mechanism is devised to dynamically aggregate forward-backward temporal contexts to reinforce cross-frame feature representations. Experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequences, effectively mitigating temporal inconsistency and interference. Project and M3SVD: https://github.com/Linfeng-Tang/VideoFusion.

Paper Structure

This paper contains 19 sections, 11 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Image-oriented fusion vs. video fusion.
  • Figure 2: Schematic of image calibration and registration.
  • Figure 3: Visualization of various scenarios in M3SVD dataset.
  • Figure 4: The overall framework of our spatio-temporal collaborative video fusion network.
  • Figure 5: Qualitative comparison results on M3SVD and HDO datasets under degraded scenarios.
  • ...and 5 more figures