Table of Contents
Fetching ...

Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation

Lingyu Liu, Yaxiong Wang, Li Zhu, Zhedong Zheng

Abstract

Video frame interpolation aims to synthesize realistic intermediate frames between given endpoints while adhering to specific motion semantics. While recent generative models have improved visual fidelity, they predominantly operate in a unidirectional manner, lacking mechanisms to self-verify temporal consistency. This often leads to motion drift, directional ambiguity, and boundary misalignment, especially in long-range sequences. Inspired by the principle of temporal cycle-consistency in self-supervised learning, we propose a novel bidirectional framework that enforces symmetry between forward and backward generation trajectories. Our approach introduces learnable directional tokens to explicitly condition a shared backbone on temporal orientation, enabling the model to jointly optimize forward synthesis and backward reconstruction within a single unified architecture. This cycle-consistent supervision acts as a powerful regularizer, ensuring that generated motion paths are logically reversible. Furthermore, we employ a curriculum learning strategy that progressively trains the model from short to long sequences, stabilizing dynamics across varying durations. Crucially, our cyclic constraints are applied only during training; inference requires a single forward pass, maintaining the high efficiency of the base model. Extensive experiments show that our method achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines while incurring no additional computational overhead.

Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation

Abstract

Video frame interpolation aims to synthesize realistic intermediate frames between given endpoints while adhering to specific motion semantics. While recent generative models have improved visual fidelity, they predominantly operate in a unidirectional manner, lacking mechanisms to self-verify temporal consistency. This often leads to motion drift, directional ambiguity, and boundary misalignment, especially in long-range sequences. Inspired by the principle of temporal cycle-consistency in self-supervised learning, we propose a novel bidirectional framework that enforces symmetry between forward and backward generation trajectories. Our approach introduces learnable directional tokens to explicitly condition a shared backbone on temporal orientation, enabling the model to jointly optimize forward synthesis and backward reconstruction within a single unified architecture. This cycle-consistent supervision acts as a powerful regularizer, ensuring that generated motion paths are logically reversible. Furthermore, we employ a curriculum learning strategy that progressively trains the model from short to long sequences, stabilizing dynamics across varying durations. Crucially, our cyclic constraints are applied only during training; inference requires a single forward pass, maintaining the high efficiency of the base model. Extensive experiments show that our method achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines while incurring no additional computational overhead.

Paper Structure

This paper contains 25 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Cycle-consistency of Time. Given identical start/end frames, we test temporal symmetry by generating (top) a forward sequence and (bottom) its time-reversed counterpart via swapped endpoints. The baseline fails to synthesize true backward motion and instead resolves the constraint via a directional flip, where the dog re-orients to walk forward. In contrast, our model achieves robust cycle-consistency. It captures authentic reverse dynamics, producing a coherent backward-walking video that preserves the subject's orientation, thereby closing the temporal loop faithfully.
  • Figure 2: A brief overview of our framework. During training, each ground-truth video is used to construct two samples. The forward sample interpolates from the original start frame to the original end frame. The backward sample interpolates in the reverse temporal direction, starting from the original end frame and ending at the original start frame. These two directions are controlled by distinct learnable directional tokens. The model is supervised with reconstruction losses in both latent space and pixel space for both directions, which encourages consistent motion modeling under time reversal.
  • Figure 3: Qualitative Comparisons with Baselines. Our methods (Wan+Ours and FP+Ours) achieve significantly smoother trajectories and coherent temporal dynamics on both short videos (37 frames) and long videos (73 frames). Videos can be viewed in our supplementary material.
  • Figure 4: Efficiency vs. Performance. Models closer to the top-left corner exhibit faster inference and higher quality. Circle area indicates model parameter size. Our method is complementary to the existing video interpolation models, outperforming baselines in terms of VBench score and achieving a higher average quality across the evaluated metrics, while maintaining the same inference time as their backbones.
  • Figure 5: Qualitative ablation study of key components in our framework. Our full model generates a fluid motion sequence that naturally evolves from a "sliding" preparation into a full "jump", ensuring high temporal coherence and dynamic realism. Videos can be viewed in our supplementary material.