Table of Contents
Fetching ...

Let Triggers Control: Frequency-Aware Dropout for Effective Token Control

Junyoung Koh, Hoyeon Moon, Dongha Kim, Seungmin Lee, Sanghyun Park, Min Song

Abstract

Text-to-image models such as Stable Diffusion have achieved unprecedented levels of high-fidelity visual synthesis. As these models advance, personalization of generative models -- commonly facilitated through Low-Rank Adaptation (LoRA) with a dedicated trigger token -- has become a significant area of research. Previous works have naively assumed that fine-tuning with a single trigger token to represent new concepts. However, this often results in poor controllability, where the trigger token alone fails to reliably evoke the intended concept. We attribute this issue to the frequent co-occurrence of the trigger token with the surrounding context during fine-tuning, which entangles their representations and compromises the token's semantic distinctiveness. To disentangle this, we propose Frequency-Aware Dropout (FAD) -- a novel regularization technique that improves prompt controllability without adding new parameters. FAD consists of two key components: co-occurrence analysis and curriculum-inspired scheduling. Qualitative and quantitative analyses across token-based diffusion models (SD~1.5 and SDXL) and natural language--driven backbones (FLUX and Qwen-Image) demonstrate consistent gains in prompt fidelity, stylistic precision, and user-perceived quality. Our method provides a simple yet effective dropout strategy that enhances controllability and personalization in text-to-image generation. Notably, it achieves these improvements without introducing additional parameters or architectural modifications, making it readily applicable to existing models with minimal computational overhead.

Let Triggers Control: Frequency-Aware Dropout for Effective Token Control

Abstract

Text-to-image models such as Stable Diffusion have achieved unprecedented levels of high-fidelity visual synthesis. As these models advance, personalization of generative models -- commonly facilitated through Low-Rank Adaptation (LoRA) with a dedicated trigger token -- has become a significant area of research. Previous works have naively assumed that fine-tuning with a single trigger token to represent new concepts. However, this often results in poor controllability, where the trigger token alone fails to reliably evoke the intended concept. We attribute this issue to the frequent co-occurrence of the trigger token with the surrounding context during fine-tuning, which entangles their representations and compromises the token's semantic distinctiveness. To disentangle this, we propose Frequency-Aware Dropout (FAD) -- a novel regularization technique that improves prompt controllability without adding new parameters. FAD consists of two key components: co-occurrence analysis and curriculum-inspired scheduling. Qualitative and quantitative analyses across token-based diffusion models (SD~1.5 and SDXL) and natural language--driven backbones (FLUX and Qwen-Image) demonstrate consistent gains in prompt fidelity, stylistic precision, and user-perceived quality. Our method provides a simple yet effective dropout strategy that enhances controllability and personalization in text-to-image generation. Notably, it achieves these improvements without introducing additional parameters or architectural modifications, making it readily applicable to existing models with minimal computational overhead.

Paper Structure

This paper contains 26 sections, 7 equations, 14 figures, 12 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparison of images generated using Normal Dropout and Frequency-Aware Dropout with the prompt: pochacco, riding a bike, sunny day, flower petals, where pochacco serves as the trigger token.
  • Figure 2: Comparison between Normal Dropout (top) and Frequency-Aware Dropout (FAD) (bottom). Top: Normal Dropout lets style cues scatter and the trigger token is ignored. Bottom: FAD raises the dropout of tokens that often co-occur with the trigger, binding the style to the trigger token $t$.
  • Figure 3: Tag frequency for each dataset. The trigger token is indicated in red.
  • Figure 4: Anchoring results for hsng trained with anchor token japanese man. When the anchor is removed ("except"), FAD/sFAD degenerates to a generic face (often child-like), indicating tighter identity binding. The Normal baseline still shows identity leakage under the except prompt, suggesting weaker disentanglement.
  • Figure 5: Attention map results using trigger token pikachu and faker under three dropout strategies: Normal dropout (left), FAD (middle) and sFAD (right). Each image is generated with the following prompts: pikachu --- pikachu, animal focus, black eyes, closed mouth, full body, looking at viewer, no humans, simple background, smile, solo, standing, straight-on. faker --- faker, 1boy, asian, black eyes, black hair, glasses, grey background, hand on own chin, hand up, looking at viewer, male focus, photorealistic, red jacket, round eyewear, short hair, simple background, solo, upper body.
  • ...and 9 more figures