SAGE: Structure-Aware Generative Video Transitions between Diverse Clips
Mia Kan, Yilin Liu, Niloy Mitra
TL;DR
SAGE tackles the problem of creating coherent transitions between diverse clips by leveraging structure-aware guidance to a pretrained diffusion-based inbetweening model in a zero-shot setting. It detects and matches salient line structures, propagates them with motion-aware B-spline trajectories, and conditions the diffusion model on edge maps to produce semantically consistent frames. The method blends geometric priors with generative synthesis to maintain structural integrity and motion continuity, outperforming cross-fade and state-of-the-art baselines on both quantitative metrics and user studies. The approach enables content-aware transitions without fine-tuning and opens avenues for semantic and appearance-guided enhancements.
Abstract
Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.
