Table of Contents
Fetching ...

Uniform Discrete Diffusion with Metric Path for Video Generation

Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, Xinlong Wang

TL;DR

This paper tackles the gap between continuous diffusion and discrete token-based video generation by proposing URSA, a Uniform discrete Diffusion with a Metric Path. URSA performs global, token-level refinement along a learned metric-guided probability path, augmented by a Linearized Metric Path and a Resolution-dependent Timestep Shifting strategy to handle long sequences, plus asynchronous per-frame timesteps for multi-task training. The approach achieves state-of-the-art or competitive results on discrete baselines across text-to-video, image-to-video, and text-to-image tasks, while approaching the performance of continuous diffusion models and enabling scalable, multi-task generation with fewer inference steps. The work offers a practical pathway to high-quality, versatile video generation with discrete tokens and provides extensive ablations and datasets to support its claims.

Abstract

Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA

Uniform Discrete Diffusion with Metric Path for Video Generation

TL;DR

This paper tackles the gap between continuous diffusion and discrete token-based video generation by proposing URSA, a Uniform discrete Diffusion with a Metric Path. URSA performs global, token-level refinement along a learned metric-guided probability path, augmented by a Linearized Metric Path and a Resolution-dependent Timestep Shifting strategy to handle long sequences, plus asynchronous per-frame timesteps for multi-task training. The approach achieves state-of-the-art or competitive results on discrete baselines across text-to-video, image-to-video, and text-to-image tasks, while approaching the performance of continuous diffusion models and enabling scalable, multi-task generation with fewer inference steps. The work offers a practical pathway to high-quality, versatile video generation with discrete tokens and provides extensive ablations and datasets to support its claims.

Abstract

Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA

Paper Structure

This paper contains 18 sections, 6 equations, 10 figures, 3 tables, 2 algorithms.

Figures (10)

  • Figure 1: Visualization of URSA across diverse video generation tasks: text-to-video generation, video interpolation, and long video generation. These examples underscore the versatility of URSA.
  • Figure 2: Illustration of different image/video generation paradigms. Discrete-space approaches such as AR and MDM adopt non-refinable local generation, where produced tokens are fixed once generated. In contrast, URSA introduces iterative global refinement, conceptually aligning discrete methods with continuous-space approaches, and substantially narrowing their performance gap.
  • Figure 3: Global refinement via token distance in embedding space. Starting from categorical noise $x_0$ (left), our framework refines data based on token distance to get target data $x_1$ (right), enabling hierarchical structure generation from global semantics to fine details.
  • Figure 4: Sampling performance across inference steps. Using the Cosmos tokenizer, we evaluate the image samples at 256$\times$256 ($\sim$1K tokens) and the video samples at 25$\times$384$\times$240 ($\sim$10K tokens).
  • Figure 5: Sampling performance of different paths. We evaluate the image samples at 256$\times$256.
  • ...and 5 more figures