Table of Contents
Fetching ...

Arbitrary Generative Video Interpolation

Guozhen Zhang, Haiguang Wang, Chunyu Wang, Yuan Zhou, Qinglin Lu, Limin Wang

TL;DR

ArbInterp addresses the rigidity of traditional video frame interpolation by enabling generation at arbitrary timestamps and any output length. It introduces timestamp-aware RoPE (TaRoPE) to encode continuous temporal positions, and an appearance-motion decoupled conditioning strategy to ensure cross-segment coherence in long sequences. Empirical results on MultiInterpBench show superior fidelity and seamless spatiotemporal continuity across 2x–32x interpolations, outperforming state-of-the-art methods and demonstrating practical flexibility for real-world applications. The approach offers a scalable, efficient paradigm for generative VFI with continuous dynamics, paving the way for further enhancements such as text-guided control and larger-scale models.

Abstract

Video frame interpolation (VFI), which generates intermediate frames from given start and end frames, has become a fundamental function in video generation applications. However, existing generative VFI methods are constrained to synthesize a fixed number of intermediate frames, lacking the flexibility to adjust generated frame rates or total sequence duration. In this work, we present ArbInterp, a novel generative VFI framework that enables efficient interpolation at any timestamp and of any length. Specifically, to support interpolation at any timestamp, we propose the Timestamp-aware Rotary Position Embedding (TaRoPE), which modulates positions in temporal RoPE to align generated frames with target normalized timestamps. This design enables fine-grained control over frame timestamps, addressing the inflexibility of fixed-position paradigms in prior work. For any-length interpolation, we decompose long-sequence generation into segment-wise frame synthesis. We further design a novel appearance-motion decoupled conditioning strategy: it leverages prior segment endpoints to enforce appearance consistency and temporal semantics to maintain motion coherence, ensuring seamless spatiotemporal transitions across segments. Experimentally, we build comprehensive benchmarks for multi-scale frame interpolation (2x to 32x) to assess generalizability across arbitrary interpolation factors. Results show that ArbInterp outperforms prior methods across all scenarios with higher fidelity and more seamless spatiotemporal continuity. Project website: https://mcg-nju.github.io/ArbInterp-Web/.

Arbitrary Generative Video Interpolation

TL;DR

ArbInterp addresses the rigidity of traditional video frame interpolation by enabling generation at arbitrary timestamps and any output length. It introduces timestamp-aware RoPE (TaRoPE) to encode continuous temporal positions, and an appearance-motion decoupled conditioning strategy to ensure cross-segment coherence in long sequences. Empirical results on MultiInterpBench show superior fidelity and seamless spatiotemporal continuity across 2x–32x interpolations, outperforming state-of-the-art methods and demonstrating practical flexibility for real-world applications. The approach offers a scalable, efficient paradigm for generative VFI with continuous dynamics, paving the way for further enhancements such as text-guided control and larger-scale models.

Abstract

Video frame interpolation (VFI), which generates intermediate frames from given start and end frames, has become a fundamental function in video generation applications. However, existing generative VFI methods are constrained to synthesize a fixed number of intermediate frames, lacking the flexibility to adjust generated frame rates or total sequence duration. In this work, we present ArbInterp, a novel generative VFI framework that enables efficient interpolation at any timestamp and of any length. Specifically, to support interpolation at any timestamp, we propose the Timestamp-aware Rotary Position Embedding (TaRoPE), which modulates positions in temporal RoPE to align generated frames with target normalized timestamps. This design enables fine-grained control over frame timestamps, addressing the inflexibility of fixed-position paradigms in prior work. For any-length interpolation, we decompose long-sequence generation into segment-wise frame synthesis. We further design a novel appearance-motion decoupled conditioning strategy: it leverages prior segment endpoints to enforce appearance consistency and temporal semantics to maintain motion coherence, ensuring seamless spatiotemporal transitions across segments. Experimentally, we build comprehensive benchmarks for multi-scale frame interpolation (2x to 32x) to assess generalizability across arbitrary interpolation factors. Results show that ArbInterp outperforms prior methods across all scenarios with higher fidelity and more seamless spatiotemporal continuity. Project website: https://mcg-nju.github.io/ArbInterp-Web/.

Paper Structure

This paper contains 42 sections, 7 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: A comparison between the fixed interpolation paradigm (a) and our proposed ArbInterp (b). ArbInterp enables flexible control of the temporal positions of generated intermediate frames by specifying any timestamps between 0 and 1.
  • Figure 2: Overall architecture of ArbInterp. Our framework enables arbitrary-length interpolation with continuous timestamps using Timestep-aware Rotary Position Embedding (TaROPE). Additionally, we introduce an appearance-motion decoupling conditioning strategy to enhance the performance of long-term interpolation. This strategy ensures appearance consistency via prefix frame guidance and enforces motion continuity through motion tokens.
  • Figure 3: Comparison of interpolation strategies in ArbInterp. ArbInterp supports multiple interpolation strategies: (a) Direct Interpolation for short-range interpolation, and two for long-term scenarios: (b) Segment-by-Segment Interpolation and (c) Hierarchical Interpolation.
  • Figure 4: Comparison of different conditioning strategies: (a) direct latent conditioning, (b) cross-attention conditioning, and (c) our proposed appearance-motion decoupling conditioning strategy.
  • Figure 5: Visual comparison. The timestamps of the intermediate frames are 0.25, 0.5, and 0.75, respectively. demonstrates significant advantages in stability and consistency.
  • ...and 5 more figures