Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene Instructional Video Synthesis
Vasco Ramos, Yonatan Bitton, Michal Yarom, Idan Szpektor, Joao Magalhaes
TL;DR
CoSeD tackles non-linear, multi-scene instructional video synthesis by grounding each new scene in the most coherent prior context through a contrastive sequential diffusion framework. It blends sequential language conditioning, denoising conditioning, and a contrastive multi-scene selection that leverages CLIP-based text and vision embeddings to select the most coherent scene across the task sequence. The approach demonstrates strong improvements in semantic and sequence consistency, with competitive or superior performance on automatic CLIP metrics and favorable human evaluations compared to strong baselines. The method remains compact and adaptable, enabling efficient training and fine-tuning across domains, and it supports non-linear task structures that reflect real-world instructional sequences.
Abstract
Generated video scenes for action-centric sequence descriptions, such as recipe instructions and do-it-yourself projects, often include non-linear patterns, where the next video may need to be visually consistent not with the immediately preceding video but with earlier ones. Current multi-scene video synthesis approaches fail to meet these consistency requirements. To address this, we propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene. The result is a multi-scene video that is grounded in the scene descriptions and coherent w.r.t. the scenes that require visual consistency. Experiments with action-centered data from the real world demonstrate the practicality and improved consistency of our model compared to previous work.
