Table of Contents
Fetching ...

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows

Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai

TL;DR

STARFlow-V tackles the challenge of autoregressive video generation with end-to-end likelihood-based modeling by pairing a global–local autoregressive normalizing flow with a light-weight, causally constrained denoiser and a Jacobi-based sampling strategy. The method enables text-to-video, image-to-video, and video-to-video generation in a single backbone, achieving competitive visual fidelity and temporal coherence relative to diffusion baselines while offering native likelihood estimation. The key contributions are the global–local factorization, flow-score matching denoising, and block-wise Jacobi inference that dramatically speeds up sampling. This work positions normalizing flows as a viable world-model backbone for video, with potential impact on controllable video synthesis, editing, and simulation.

Abstract

Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows

TL;DR

STARFlow-V tackles the challenge of autoregressive video generation with end-to-end likelihood-based modeling by pairing a global–local autoregressive normalizing flow with a light-weight, causally constrained denoiser and a Jacobi-based sampling strategy. The method enables text-to-video, image-to-video, and video-to-video generation in a single backbone, achieving competitive visual fidelity and temporal coherence relative to diffusion baselines while offering native likelihood estimation. The key contributions are the global–local factorization, flow-score matching denoising, and block-wise Jacobi inference that dramatically speeds up sampling. This work positions normalizing flows as a viable world-model backbone for video, with potential impact on controllable video synthesis, editing, and simulation.

Abstract

Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.

Paper Structure

This paper contains 36 sections, 12 equations, 9 figures, 3 tables, 3 algorithms.

Figures (9)

  • Figure 1: Samples from STARFlow-V across three tasks. All videos are 480 p at 16 fps. Red boxes mark the conditioning inputs. The same autoregressive architecture is used for all tasks with no task-specific modifications. Please find more generated videos and comparisons in the released code https://github.com/apple/ml-starflow.
  • Figure 2: An illustrated pipeline of STARFlow-V which shows (1) the proposed global-local architecture; (2) joint training with the learnable denoiser with the proposed Flow-score Matching. During sampling, STARFlow-V takes the encoded text condition ${\bm{t}}$ and transforms the noise ${\bm{z}}$ through deep global block to intermediate features ${\bm{u}}$, followed by several local shallow blocks to produce a slightly noised video. Finally, a learnable causal denoiser refines this output into the final clean video ${\bm{x}}$.
  • Figure 3: STARFlow-V comparison against baselines on autoregressive generation for both trained length (5s) and long-horizon generation (30s). Please refer to more video comparison in the project page.
  • Figure 4: Comparison between speed and block size in block-wise Jacobi iteration.
  • Figure 5: Ablation study for the choice of denoiser. We compare video VAE reconstruction quality across denoising approaches over $1,000$ random videos with large motions.
  • ...and 4 more figures