Table of Contents
Fetching ...

Playing with Transformer at 30+ FPS via Next-Frame Diffusion

Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, Jiang Bian

TL;DR

The paper tackles real-time, high-fidelity action-conditioned video generation by marrying diffusion models with autoregressive causality. It introduces Next-Frame Diffusion (NFD), a diffusion Transformer featuring block-wise causal attention that generates frames in parallel within each time step while conditioning on past frames. To reach interactive speeds, it leverages video-domain consistency distillation and speculative sampling, plus noise-injection to curb error accumulation. On a large Minecraft gameplay benchmark, NFD achieves autoregressive-level fidelity with substantial speedups, surpassing prior autoregressive baselines and reaching over 30 FPS on an A100 with a 310M parameter model. These advances enable practical, controllable video generation for interactive and streaming applications, while highlighting scalability and dataset-domain considerations for future work.

Abstract

Autoregressive video models offer distinct advantages over bidirectional diffusion models in creating interactive video content and supporting streaming applications with arbitrary duration. In this work, we present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that incorporates block-wise causal attention, enabling iterative sampling and efficient inference via parallel token generation within each frame. Nonetheless, achieving real-time video generation remains a significant challenge for such models, primarily due to the high computational cost associated with diffusion sampling and the hardware inefficiencies inherent to autoregressive generation. To address this, we introduce two innovations: (1) We extend consistency distillation to the video domain and adapt it specifically for video models, enabling efficient inference with few sampling steps; (2) To fully leverage parallel computation, motivated by the observation that adjacent frames often share the identical action input, we propose speculative sampling. In this approach, the model generates next few frames using current action input, and discard speculatively generated frames if the input action differs. Experiments on a large-scale action-conditioned video generation benchmark demonstrate that NFD beats autoregressive baselines in terms of both visual quality and sampling efficiency. We, for the first time, achieves autoregressive video generation at over 30 Frames Per Second (FPS) on an A100 GPU using a 310M model.

Playing with Transformer at 30+ FPS via Next-Frame Diffusion

TL;DR

The paper tackles real-time, high-fidelity action-conditioned video generation by marrying diffusion models with autoregressive causality. It introduces Next-Frame Diffusion (NFD), a diffusion Transformer featuring block-wise causal attention that generates frames in parallel within each time step while conditioning on past frames. To reach interactive speeds, it leverages video-domain consistency distillation and speculative sampling, plus noise-injection to curb error accumulation. On a large Minecraft gameplay benchmark, NFD achieves autoregressive-level fidelity with substantial speedups, surpassing prior autoregressive baselines and reaching over 30 FPS on an A100 with a 310M parameter model. These advances enable practical, controllable video generation for interactive and streaming applications, while highlighting scalability and dataset-domain considerations for future work.

Abstract

Autoregressive video models offer distinct advantages over bidirectional diffusion models in creating interactive video content and supporting streaming applications with arbitrary duration. In this work, we present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that incorporates block-wise causal attention, enabling iterative sampling and efficient inference via parallel token generation within each frame. Nonetheless, achieving real-time video generation remains a significant challenge for such models, primarily due to the high computational cost associated with diffusion sampling and the hardware inefficiencies inherent to autoregressive generation. To address this, we introduce two innovations: (1) We extend consistency distillation to the video domain and adapt it specifically for video models, enabling efficient inference with few sampling steps; (2) To fully leverage parallel computation, motivated by the observation that adjacent frames often share the identical action input, we propose speculative sampling. In this approach, the model generates next few frames using current action input, and discard speculatively generated frames if the input action differs. Experiments on a large-scale action-conditioned video generation benchmark demonstrate that NFD beats autoregressive baselines in terms of both visual quality and sampling efficiency. We, for the first time, achieves autoregressive video generation at over 30 Frames Per Second (FPS) on an A100 GPU using a 310M model.

Paper Structure

This paper contains 48 sections, 10 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: We present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that employs block-wise causal attention. This design enables parallel generation of multiple tokens for an entire frame, thereby enhancing sampling efficiency and better aligning with hardware constraints.
  • Figure 2: Qualitative results of the generated videos. Each row depicts a sequence of frames generated in response to a specific action command, such as sprint, attack, jump, and camera shift.
  • Figure 3: Frames generated by NFD+ and MineWorld respectively, which illustrates the superior temporal consistency achieved by NFD+. Despite a significant camera movement, NFD+ preserves a coherent and artifact-free background, whereas MineWorld introduces noticeable background distortions. This highlights NFD+’s robustness in maintaining scene integrity and temporal consistency.
  • Figure 4: Frames generated by NFD+ and MineWorld respectively, which illustrates a door-opening sequence. NFD+ successfully renders the doors opening widely with no visible distortions, maintaining structural coherence. In contrast, MineWorld introduces a spurious artifact—a distorted line appearing between the two doors—highlighting its struggle with fine-grained object interactions.
  • Figure 5: In this case, both models have previously encountered the brown block. NFD+ successfully reconstructs the block with high fidelity, while MineWorld fails to do so. This highlights the effectiveness of NFD+'s memorization capability in preserving object identity over time.