Table of Contents
Fetching ...

Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention

Taiye Chen, Zihan Ding, Anjian Li, Christina Zhang, Zeqi Xiao, Yisen Wang, Chi Jin

TL;DR

RAD addresses the forgetting and spatiotemporal inconsistencies in long-horizon video diffusion by integrating a memory-augmented Recurrent Neural Network into the Diffusion Transformer. It compares LSTM, Mamba2, and TTT within a frame-wise autoregressive framework and introduces a hidden-state prefetch mechanism to preserve training-time parallelism. The results on Memory Maze and Minecraft show that frame-wise autoregression with memory overlap substantially improves consistency and fidelity, with LSTM typically offering the best long-range performance. Overall, the approach provides a scalable path to high-quality, long-term video generation by blending global memory with local attention in diffusion models.

Abstract

Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long-term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state-of-the-art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion-RNN approaches often suffer from performance degradation due to training-inference gap or the lack of overlap across windows. To address these limitations, we propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval, consistently across training and inference time. Experiments on Memory Maze and Minecraft datasets demonstrate the superiority of RAD for long video generation, highlighting the efficiency of LSTM in sequence modeling.

Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention

TL;DR

RAD addresses the forgetting and spatiotemporal inconsistencies in long-horizon video diffusion by integrating a memory-augmented Recurrent Neural Network into the Diffusion Transformer. It compares LSTM, Mamba2, and TTT within a frame-wise autoregressive framework and introduces a hidden-state prefetch mechanism to preserve training-time parallelism. The results on Memory Maze and Minecraft show that frame-wise autoregression with memory overlap substantially improves consistency and fidelity, with LSTM typically offering the best long-range performance. Overall, the approach provides a scalable path to high-quality, long-term video generation by blending global memory with local attention in diffusion models.

Abstract

Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long-term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state-of-the-art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion-RNN approaches often suffer from performance degradation due to training-inference gap or the lack of overlap across windows. To address these limitations, we propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval, consistently across training and inference time. Experiments on Memory Maze and Minecraft datasets demonstrate the superiority of RAD for long video generation, highlighting the efficiency of LSTM in sequence modeling.

Paper Structure

This paper contains 42 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Training paradigm for Recurrent Autoregressive Diffusion with global memory and local attention: The model has three components in each DiT block, including spatial attention, temporal attention and RNN memory block. It supports both chunk-wise and frame-wise autoregressive generation with different attention mechanisms. For efficient training of frame-wise RNN, we (1). pre-fetch the hidden states from clean sample sequence to enable parallel attention computation across entire long sequences, and (2) conduct diffusion model forward in standard manner to get diffusion loss. This improves efficiency and fidelity for large-scale, long-sequence video modeling. $h^i_j$ is the $i$-th layer hidden state for frame index $j$.
  • Figure 2: Recurrent Autoregressive Diffusion model architecture
  • Figure 3: Comparison of effective temporal attention maps for chunk-wise and frame-wise RNNs, with chunk size 3 and 2 initial context frames. The horizontal axis of the graph represents the frame index, while the vertical axis represents the index of the currently predicted frame.
  • Figure 4: Top view of Memory Maze data
  • Figure 5: Visualization results on Minecraft dataset
  • ...and 2 more figures