Table of Contents
Fetching ...

Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene Instructional Video Synthesis

Vasco Ramos, Yonatan Bitton, Michal Yarom, Idan Szpektor, Joao Magalhaes

TL;DR

CoSeD tackles non-linear, multi-scene instructional video synthesis by grounding each new scene in the most coherent prior context through a contrastive sequential diffusion framework. It blends sequential language conditioning, denoising conditioning, and a contrastive multi-scene selection that leverages CLIP-based text and vision embeddings to select the most coherent scene across the task sequence. The approach demonstrates strong improvements in semantic and sequence consistency, with competitive or superior performance on automatic CLIP metrics and favorable human evaluations compared to strong baselines. The method remains compact and adaptable, enabling efficient training and fine-tuning across domains, and it supports non-linear task structures that reflect real-world instructional sequences.

Abstract

Generated video scenes for action-centric sequence descriptions, such as recipe instructions and do-it-yourself projects, often include non-linear patterns, where the next video may need to be visually consistent not with the immediately preceding video but with earlier ones. Current multi-scene video synthesis approaches fail to meet these consistency requirements. To address this, we propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene. The result is a multi-scene video that is grounded in the scene descriptions and coherent w.r.t. the scenes that require visual consistency. Experiments with action-centered data from the real world demonstrate the practicality and improved consistency of our model compared to previous work.

Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene Instructional Video Synthesis

TL;DR

CoSeD tackles non-linear, multi-scene instructional video synthesis by grounding each new scene in the most coherent prior context through a contrastive sequential diffusion framework. It blends sequential language conditioning, denoising conditioning, and a contrastive multi-scene selection that leverages CLIP-based text and vision embeddings to select the most coherent scene across the task sequence. The approach demonstrates strong improvements in semantic and sequence consistency, with competitive or superior performance on automatic CLIP metrics and favorable human evaluations compared to strong baselines. The method remains compact and adaptable, enabling efficient training and fine-tuning across domains, and it supports non-linear task structures that reflect real-world instructional sequences.

Abstract

Generated video scenes for action-centric sequence descriptions, such as recipe instructions and do-it-yourself projects, often include non-linear patterns, where the next video may need to be visually consistent not with the immediately preceding video but with earlier ones. Current multi-scene video synthesis approaches fail to meet these consistency requirements. To address this, we propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene. The result is a multi-scene video that is grounded in the scene descriptions and coherent w.r.t. the scenes that require visual consistency. Experiments with action-centered data from the real world demonstrate the practicality and improved consistency of our model compared to previous work.
Paper Structure (43 sections, 9 equations, 18 figures, 4 tables)

This paper contains 43 sections, 9 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: CoSeD is grounded on an input sequence of step actions to synthesize non-linear, multi-scene instructional videos.
  • Figure 2: The proposed contrastive denoising diffusion learning architecture. The contrastive learning component captures the temporal relationships between conditioned scenes and preceding scenes, ensuring coherent transitions throughout the video.
  • Figure 3: Multi-scene V&L contrastive learning uses multiple sequences. This multi sequence information serves as both positive and negative pairs helping the model to learn the best next scene according to the ground-truth scenes.
  • Figure 4: Example of an illustration for the recipe domain.
  • Figure 5: Example of an illustration for the DIY domain.
  • ...and 13 more figures