Table of Contents
Fetching ...

AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang, Manyuan Zhang, Ser-Nam Lim, Harry Yang

TL;DR

TI2V methods often fail to realize fine-grained edits described by prompts. AlignVid offers a training-free solution by reweighting attention through Attention Scaling Modulation (ASM) and Guidance Scheduling (GS), grounded in an energy-based interpretation of attention to sharpen semantics while preserving visual fidelity. The authors introduce the OmitI2V benchmark to quantify semantic negligence and demonstrate consistent improvements in semantic fidelity and motion dynamics across multiple TI2V architectures, with only minor aesthetic trade-offs. Additional experiments show good generalization to text-to-image, text-to-video, and image editing tasks, highlighting the method's versatility and practicality without retraining. The work provides a lightweight, broadly applicable tool for enforcing prompt semantics in TI2V generation.

Abstract

Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence. However, existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image (e.g., object addition, deletion, or modification), a shortcoming we term semantic negligence. In a pilot study, we find that applying a Gaussian blur to the input image improves semantic adherence. Analyzing attention maps, we observe clearer foreground-background separation. From an energy perspective, this corresponds to a lower-entropy cross-attention distribution. Motivated by this, we introduce AlignVid, a training-free framework with two components: (i) Attention Scaling Modulation (ASM), which directly reweights attention via lightweight Q or K scaling, and (ii) Guidance Scheduling (GS), which applies ASM selectively across transformer blocks and denoising steps to reduce visual quality degradation. This minimal intervention improves prompt adherence while limiting aesthetic degradation. In addition, we introduce OmitI2V to evaluate semantic negligence in TI2V generation, comprising 367 human-annotated samples that span addition, deletion, and modification scenarios. Extensive experiments demonstrate that AlignVid can enhance semantic fidelity.

AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

TL;DR

TI2V methods often fail to realize fine-grained edits described by prompts. AlignVid offers a training-free solution by reweighting attention through Attention Scaling Modulation (ASM) and Guidance Scheduling (GS), grounded in an energy-based interpretation of attention to sharpen semantics while preserving visual fidelity. The authors introduce the OmitI2V benchmark to quantify semantic negligence and demonstrate consistent improvements in semantic fidelity and motion dynamics across multiple TI2V architectures, with only minor aesthetic trade-offs. Additional experiments show good generalization to text-to-image, text-to-video, and image editing tasks, highlighting the method's versatility and practicality without retraining. The work provides a lightweight, broadly applicable tool for enforcing prompt semantics in TI2V generation.

Abstract

Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence. However, existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image (e.g., object addition, deletion, or modification), a shortcoming we term semantic negligence. In a pilot study, we find that applying a Gaussian blur to the input image improves semantic adherence. Analyzing attention maps, we observe clearer foreground-background separation. From an energy perspective, this corresponds to a lower-entropy cross-attention distribution. Motivated by this, we introduce AlignVid, a training-free framework with two components: (i) Attention Scaling Modulation (ASM), which directly reweights attention via lightweight Q or K scaling, and (ii) Guidance Scheduling (GS), which applies ASM selectively across transformer blocks and denoising steps to reduce visual quality degradation. This minimal intervention improves prompt adherence while limiting aesthetic degradation. In addition, we introduce OmitI2V to evaluate semantic negligence in TI2V generation, comprising 367 human-annotated samples that span addition, deletion, and modification scenarios. Extensive experiments demonstrate that AlignVid can enhance semantic fidelity.

Paper Structure

This paper contains 55 sections, 4 theorems, 51 equations, 21 figures, 16 tables, 2 algorithms.

Key Result

Lemma 4.1

Consider scaling the query or key embeddings by a positive scalar $\gamma_t>0$. Replacing $Q_t$ by $\gamma_t Q_t$ (or $K_t$ by $\gamma_t K_t$) yields so each row of the attention uses a softmax with temperature $\alpha_t=\gamma_t$, i.e. $p^{(i)}(\alpha_t)=\sigma(\alpha_t z^{(i)})$.

Figures (21)

  • Figure 1: The baseline model (FramePack) exhibits semantic negligence, failing to realize the prompt-specified modifications. In (a), the sunflower mentioned in the prompt is entirely missing. In (b), the person remains static instead of climbing onto the tank as instructed.
  • Figure 2: Pilot example.(a) Videos and attention maps generated from the original input image (top) and from the same image after applying Gaussian blur (bottom). (b) Attention map visualization. For the original input, the model assigns high attention scores to the reference image, low scores to the text tokens, and weak attention across video frames. When the blurred image is used as input, attention to the image is suppressed, while attention to the text and temporal neighbors is strengthened. (c) Statistics over 30 sampled benchmark examples, comparing attention scores in different regions before and after blur (top), as well as the ratio of attention entropy. Adding blur can increase cross-attention score while reducing entropy, indicating sharper and focused attention.
  • Figure 3: Attention analysis. ASM sharpens attention (lower entropy), boosts focus on text tokens and adjacent frames, and suppresses static-image regions.
  • Figure 4: Statistical distributions of the OmitI2V benchmark.
  • Figure 5: Sample ("Modification" task) from the OmitI2V benchmark.
  • ...and 16 more figures

Theorems & Definitions (6)

  • Lemma 4.1: Q/K scaling as temperature control
  • Lemma 4.2: Within-block entropy monotonicity
  • Theorem C.1: Lipschitz Continuity of Attention Output
  • proof : Detailed Proof
  • Proposition C.2: Upper Bound on State Deviation
  • proof : Detailed Proof