Table of Contents
Fetching ...

VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By Value Sign Flip

Wenqi Guo, Shan Du

TL;DR

VSF addresses the persistent challenge of enforcing negative prompts in fast, few-step diffusion and flow-matching models by dynamically flipping the sign of negative-prompt attention contributions at the token level. By combining adaptive attention with careful duplication/masking of negative embeddings, VSF achieves stronger negative-content avoidance while preserving positive prompt fidelity and image quality, even in extremely fast generation regimes. Empirical results on NegGenBench and multiple baselines show VSF outperforming NASA, NAG, and CFG in negative adherence, with competitive or superior quality, and qualitative attention analyses corroborate the mechanism. The approach is practical, computationally efficient, and broadly compatible with contemporary transformer-based diffusion architectures, offering a straightforward path to safer and more controllable image and video generation.

Abstract

We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in https://github.com/weathon/VSF/tree/main.

VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By Value Sign Flip

TL;DR

VSF addresses the persistent challenge of enforcing negative prompts in fast, few-step diffusion and flow-matching models by dynamically flipping the sign of negative-prompt attention contributions at the token level. By combining adaptive attention with careful duplication/masking of negative embeddings, VSF achieves stronger negative-content avoidance while preserving positive prompt fidelity and image quality, even in extremely fast generation regimes. Empirical results on NegGenBench and multiple baselines show VSF outperforming NASA, NAG, and CFG in negative adherence, with competitive or superior quality, and qualitative attention analyses corroborate the mechanism. The approach is practical, computationally efficient, and broadly compatible with contemporary transformer-based diffusion architectures, offering a straightforward path to safer and more controllable image and video generation.

Abstract

We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in https://github.com/weathon/VSF/tree/main.

Paper Structure

This paper contains 39 sections, 19 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Original image without negative guidance and image generated using our VSF negative guidance on Stable Diffusion 3.5 Large Turbo. The green prompt is the positive prompt, and the red one is the negative prompt. These examples have significant challenges as they are removing essential parts of an object. The "hands" in the last image mean clock hands.
  • Figure 2: An example of forcefully applying CFG to a step-distilled model is shown using a guidance scale of 2.8 and only 4 steps on SD-3.5-Large Turbo. The positive prompt describes a Canadian winter with a capybara, while the negative prompt includes the word “snow.” The resulting image merges these conflicting concepts unnaturally and exhibits severe over-saturation artifacts.
  • Figure 3: The attention mechanism of our method. We pass in image tokens ($I$), positive prompt tokens ($P$), and negative prompt tokens ($N$) into attention. For key and values, $N$ is duplicated, with values of one copy ($N^{(1)}$) scaled by $-\alpha$. Some areas are masked to avoid interference. An bias $-\beta$ is added to $I\rightarrow N^{(1)}$ attention.
  • Figure 4: Attention maps and intermediate images during the diffusion process. The leftmost column shows the final generated image (top) and an image generated without applying VSF scaling ($\alpha=0$, bottom). The top row on the right side displays the unnormalized attention values between image tokens and negative prompt tokens, while the bottom row shows the corresponding intermediate images at each timestep. The negative prompt is "unbrulla."
  • Figure 5: (Left) Style Avoidance Tests, (Right) Abstract art generated by mentioning the main object "car" in a negative prompt.
  • ...and 14 more figures

Theorems & Definitions (1)

  • proof