Table of Contents
Fetching ...

SAGE: Structure-Aware Generative Video Transitions between Diverse Clips

Mia Kan, Yilin Liu, Niloy Mitra

TL;DR

SAGE tackles the problem of creating coherent transitions between diverse clips by leveraging structure-aware guidance to a pretrained diffusion-based inbetweening model in a zero-shot setting. It detects and matches salient line structures, propagates them with motion-aware B-spline trajectories, and conditions the diffusion model on edge maps to produce semantically consistent frames. The method blends geometric priors with generative synthesis to maintain structural integrity and motion continuity, outperforming cross-fade and state-of-the-art baselines on both quantitative metrics and user studies. The approach enables content-aware transitions without fine-tuning and opens avenues for semantic and appearance-guided enhancements.

Abstract

Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.

SAGE: Structure-Aware Generative Video Transitions between Diverse Clips

TL;DR

SAGE tackles the problem of creating coherent transitions between diverse clips by leveraging structure-aware guidance to a pretrained diffusion-based inbetweening model in a zero-shot setting. It detects and matches salient line structures, propagates them with motion-aware B-spline trajectories, and conditions the diffusion model on edge maps to produce semantically consistent frames. The method blends geometric priors with generative synthesis to maintain structural integrity and motion continuity, outperforming cross-fade and state-of-the-art baselines on both quantitative metrics and user studies. The approach enables content-aware transitions without fine-tuning and opens avenues for semantic and appearance-guided enhancements.

Abstract

Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.

Paper Structure

This paper contains 11 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Artist-designed transitions. Two artist-crafted transitions illustrate the heuristics that inspire SAGE; full sequences are provided in the supplemental. (i) Structural anchoring: silhouettes and edges are aligned across clips to prevent scene collapse, as highlighted by the matching colored lines. (ii) Motion continuity: dominant flows such as camera pans are preserved to ensure fluid evolution, as indicated by the white arrows. (iii) Layered blending: foreground objects morph while backgrounds fade, reducing ghosting and clutter (not depicted here). These principles motivate our design of structure- and motion-aware generative transitions.
  • Figure 2: Method overview. Given two clips, we extract structural lines, optical flow, and foreground masks (Stage I). We match and interpolate these structures using motion-aware B-spline trajectories (Stage II), producing intermediate line sets $\{L_t\}_{t=1}^T$. These are then used to condition a pretrained generative inbetweening model (Stage III), yielding smooth and motion-aware transitions between diverse clips.
  • Figure 3: Trajectory ablations for structural guidance. (a) Input frames with computed optical flow and segmentation; (b) Linear interpolation of all matched lines across foreground and background, resulting in trajectory crossings and line mismatches when semantic structure is ignored; (c) Linear interpolation restricted to foreground lines (selected by $M_A$), yielding clearer trajectories for salient structures but still exhibiting crossover and motion inconsistency; (d) Motion-aware guidance combining the global bounding-box trajectory $\{B_t\}_{t=1}^T$ with local line trajectories $\{L_t\}_{t=1}^T$, aligning structural evolution with scene/camera motion, as indicated in (a), while reducing trajectory crossovers.
  • Figure 4: Result gallery. Qualitative results on diverse video clips, showcasing the model's performance on complex transitions in scene scale (local-global), object category, and motion direction. Full videos are available on our supplementary webpage.
  • Figure 5: Comparisons. Qualitative comparison with baseline methods, demonstrating that SAGE generates more plausible video transitions by maintaining consistency in motion, foreground objects, and background scenery. Full videos are available on our supplementary webpage.
  • ...and 1 more figures