Table of Contents
Fetching ...

Learning Plug-and-play Memory for Guiding Video Diffusion Models

Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo, Lianhui Qin, Biwei Huang

TL;DR

The paper tackles the physics and commonsense inadequacies of diffusion-transformer video generation by introducing DiT-Mem, a plug-and-play memory encoder that injects external world knowledge without retraining the backbone. It demonstrates, through training-free interventions and a dedicated memory encoder with 3D CNNs and frequency-aware filters, that separating appearance (low-frequency) from dynamics (high-frequency) enables targeted guidance during generation. The memory encoder is trained end-to-end on a small dataset, achieving efficient learning (≈150M parameters, ~10K samples) and broad compatibility with DiT-based models, yielding state-of-the-art performance on physical commonsense benchmarks and solid gains in visual fidelity. This work offers a practical, scalable path to physics-consistent video synthesis by leveraging retrieved reference videos and frequency-aware memory tokens that plug into frozen diffusion backbones. The approach has clear implications for more reliable and controllable video generation in real-world applications where physical realism matters.

Abstract

Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.

Learning Plug-and-play Memory for Guiding Video Diffusion Models

TL;DR

The paper tackles the physics and commonsense inadequacies of diffusion-transformer video generation by introducing DiT-Mem, a plug-and-play memory encoder that injects external world knowledge without retraining the backbone. It demonstrates, through training-free interventions and a dedicated memory encoder with 3D CNNs and frequency-aware filters, that separating appearance (low-frequency) from dynamics (high-frequency) enables targeted guidance during generation. The memory encoder is trained end-to-end on a small dataset, achieving efficient learning (≈150M parameters, ~10K samples) and broad compatibility with DiT-based models, yielding state-of-the-art performance on physical commonsense benchmarks and solid gains in visual fidelity. This work offers a practical, scalable path to physics-consistent video synthesis by leveraging retrieved reference videos and frequency-aware memory tokens that plug into frozen diffusion backbones. The approach has clear implications for more reliable and controllable video generation in real-world applications where physical realism matters.

Abstract

Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.

Paper Structure

This paper contains 41 sections, 9 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Qualitative results of frequency-based steering interventions on Wan2.1 I2V-14B. All three cases are generated from the same prompt: "A yellow mug is held by a grabber tool in front of a white projection screen with a concrete brick positioned beneath it. The grabber releases the mug. Static shot with no camera movement." Case (b) shows that high-pass features guide physical dynamics. Case (c) shows that low-pass features encode structural and object-level information. Blue and red fonts denote positive and negative findings.
  • Figure 2: Overview of our DiT-Mem framework. Given a text prompt, we retrieve the top-$k$ relevant videos from a large external memory, encode them with the model's VAE to obtain video latents, and feed these latents into a memory encoder (3D CNNs, LPF/HPF filtering, and shared self-attention) to produce compact memory tokens. During diffusion sampling, the memory tokens are concatenated with the video tokens before each self-attention layer of the frozen DiT backbone and participate in standard multi-head self-attention as queries, keys, and values. We keep only the updated video tokens while reusing the same memory tokens at every layer, providing plug-and-play memory guidance for video generation.
  • Figure 3: Visual ablation study. All variants are generated using the same text prompt ("a person is pouring water into a teacup"), same seed, and identical set of five retrieved reference videos. The baseline with 3D convolution layers (+3D) suffers from severe artifacts, appearing as a hallucinated second floating cup. Adding the high-pass filter (+HPF) resolves this structural issue and improves motion, but results in blurred details on the person and hand. While incorporating the low-pass filter (+LPF) introduces appearance features from other objects, our full model with Shared Attention (SA) achieves the best balance. It effectively enhances motion without over-injecting retrieved object appearances, thereby preserving the semantic fidelity of the original text prompt.
  • Figure 4: Effect of memory size on PhyGenBench performance using DiT-Mem-1.3B. Larger memory banks provide richer reference knowledge and yield consistently higher scores, while the model retains strong robustness even with significantly reduced memory capacity.
  • Figure 5: Qualitative comparison between the baseline models (Wan2.1/Wan2.2) and our method. While the baselines occasionally exhibit physical hallucinations—such as the "liquid-like" splashing of a solid basketball (top-left) or missing shadows (bottom-right)—our method correctly models rigid body dynamics, fluid interactions, and lighting geometry, resulting in superior realism.
  • ...and 1 more figures