Learning Plug-and-play Memory for Guiding Video Diffusion Models
Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo, Lianhui Qin, Biwei Huang
TL;DR
The paper tackles the physics and commonsense inadequacies of diffusion-transformer video generation by introducing DiT-Mem, a plug-and-play memory encoder that injects external world knowledge without retraining the backbone. It demonstrates, through training-free interventions and a dedicated memory encoder with 3D CNNs and frequency-aware filters, that separating appearance (low-frequency) from dynamics (high-frequency) enables targeted guidance during generation. The memory encoder is trained end-to-end on a small dataset, achieving efficient learning (≈150M parameters, ~10K samples) and broad compatibility with DiT-based models, yielding state-of-the-art performance on physical commonsense benchmarks and solid gains in visual fidelity. This work offers a practical, scalable path to physics-consistent video synthesis by leveraging retrieved reference videos and frequency-aware memory tokens that plug into frozen diffusion backbones. The approach has clear implications for more reliable and controllable video generation in real-world applications where physical realism matters.
Abstract
Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.
