AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation
Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang, Manyuan Zhang, Ser-Nam Lim, Harry Yang
TL;DR
TI2V methods often fail to realize fine-grained edits described by prompts. AlignVid offers a training-free solution by reweighting attention through Attention Scaling Modulation (ASM) and Guidance Scheduling (GS), grounded in an energy-based interpretation of attention to sharpen semantics while preserving visual fidelity. The authors introduce the OmitI2V benchmark to quantify semantic negligence and demonstrate consistent improvements in semantic fidelity and motion dynamics across multiple TI2V architectures, with only minor aesthetic trade-offs. Additional experiments show good generalization to text-to-image, text-to-video, and image editing tasks, highlighting the method's versatility and practicality without retraining. The work provides a lightweight, broadly applicable tool for enforcing prompt semantics in TI2V generation.
Abstract
Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence. However, existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image (e.g., object addition, deletion, or modification), a shortcoming we term semantic negligence. In a pilot study, we find that applying a Gaussian blur to the input image improves semantic adherence. Analyzing attention maps, we observe clearer foreground-background separation. From an energy perspective, this corresponds to a lower-entropy cross-attention distribution. Motivated by this, we introduce AlignVid, a training-free framework with two components: (i) Attention Scaling Modulation (ASM), which directly reweights attention via lightweight Q or K scaling, and (ii) Guidance Scheduling (GS), which applies ASM selectively across transformer blocks and denoising steps to reduce visual quality degradation. This minimal intervention improves prompt adherence while limiting aesthetic degradation. In addition, we introduce OmitI2V to evaluate semantic negligence in TI2V generation, comprising 367 human-annotated samples that span addition, deletion, and modification scenarios. Extensive experiments demonstrate that AlignVid can enhance semantic fidelity.
