Table of Contents
Fetching ...

When to Lock Attention: Training-Free KV Control in Video Diffusion

Tianyi Zeng, Jincheng Gao, Tianyi Wang, Zijie Meng, Miao Zhang, Jun Yin, Haoyuan Sun, Junfeng Jiao, Christian Claudel, Junbo Tan, Xueqian Wang

TL;DR

KV-Lock is a training-free framework tailored for DiT-based video diffusion models that leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values and newly generated KVs, and the CFG scale.

Abstract

Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.

When to Lock Attention: Training-Free KV Control in Video Diffusion

TL;DR

KV-Lock is a training-free framework tailored for DiT-based video diffusion models that leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values and newly generated KVs, and the CFG scale.

Abstract

Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.
Paper Structure (32 sections, 19 equations, 9 figures, 3 tables)

This paper contains 32 sections, 19 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We propose KV-Lock, which dynamically schedules key-value (KV) cache and classifier-free guidance (CFG) based on diffusion model hallucination detection. It enhances foreground quality while ensuring background consistency. Experiments demonstrate that our KV-Lock outperforms VACE jiang2025vace in both reference-based and reference-free video editing tasks.
  • Figure 2: Overview of the KV-Lock framework. The encoder first encodes the inputs. Subsequently, during the inversion process, the KV pairs of source tokens are cached. Then, in the denoising process, a hallucination-detection-based scheduler enables the dynamic fusion of newly generated KV and cached KV to ensure background consistency, while dynamic scheduling of CFG enhances foreground quality. Finally, decoding is performed by the decoder.
  • Figure 3: Two comparison experimental samples. It can be observed that FLATTEN cong2023flatten and TokenFlow geyer2023tokenflow achieve an almost complete failure, with extensive distortions and artifacts. Specifically, in the first sample, the two eyes of the fox generated by VACE jiang2025vace and CFG-Zero* fan2025cfg are asymmetrical and unnatural, APG sadat2024eliminating causes obvious distortion to the fox. while the fur texture rendered by KV-Lock is more refined than that of ProEdit ouyang2025proedit. In the second sample, the distant dust generated by VACE jiang2025vace is excessively bright and unrealistic, and the road surface is an asphalt road with soil instead of a dirt road, also appearing in CFG-Zero* fan2025cfg and APG sadat2024eliminating; In CFG-Zero* fan2025cfg, the dust raised by the front vehicle exhibits an obvious unnatural boundary. For APG sadat2024eliminating, the rear vehicle generates no dust at all, and ProEdit ouyang2025proedit produces more prominent distant dust than near dust, which are contrary to common sense.
  • Figure 4: More samples of KV-Lock, including changing, removing and adding tasks.
  • Figure 5: In the early diffusion stage, the variance of in-support and hallucination samples are both high. After hallucination is detected, KV-Lock control the variance under threshold $\tau$ through dynamic scheduling.
  • ...and 4 more figures