SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction
Enrico Pallotta, Sina Mokhtarzadeh Azar, Shuai Li, Olga Zatsarynna, Juergen Gall
TL;DR
SyncVP addresses RGB-only video prediction limitations by jointly modeling RGB and depth with pre trained modality specific diffusion models. It introduces a split spatio-temporal cross-attention STCA to enable cross modal information exchange and uses a shared forward diffusion noise across modalities, improving convergence and coherence. A cross modality guidance training strategy enables robust generation under partial conditioning, allowing predictions even when one modality is missing. On Cityscapes and BAIR, SyncVP achieves state of the art FVD improvements while remaining effective when depth is unavailable, and it generalizes to other modalities such as semantic maps and climate data, highlighting strong robustness and broad applicability.
Abstract
Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world. To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions. SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities. We evaluate SyncVP on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality. We furthermore demonstrate its generalization to other modalities on SYNTHIA with semantic information and ERA5-Land with climate data. Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present, demonstrating its robustness and potential for a wide range of applications.
