Table of Contents
Fetching ...

Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation

Xiaoying Xing, Avinab Saha, Junfeng He, Susan Hao, Paul Vicol, Moonkyung Ryu, Gang Li, Sahil Singla, Sarah Young, Yinxiao Li, Feng Yang, Deepak Ramachandran

TL;DR

Focus-N-Fix tackles localized quality issues in text-to-image generation by restricting fine-tuning to problematic image regions rather than globally optimizing image-level rewards. It combines diffusion-based generation with a differentiable reward and a regional constraint, updating only LoRA parameters to preserve the original pretrained model's structure, and it operates with a standard forward pass at inference. The method uses heatmaps or saliency maps to locate artifacts, over-sexualization, violence, or misalignment, achieving localized improvements across multiple quality aspects and generalizing to SDXL and other diffusion backbones. This yields safer, more faithful T2I outputs with reduced risk of forgetting and reward hacking, enabling more reliable deployment in practice.

Abstract

Text-to-image (T2I) generation has made significant advances in recent years, but challenges still remain in the generation of perceptual artifacts, misalignment with complex prompts, and safety. The prevailing approach to address these issues involves collecting human feedback on generated images, training reward models to estimate human feedback, and then fine-tuning T2I models based on the reward models to align them with human preferences. However, while existing reward fine-tuning methods can produce images with higher rewards, they may change model behavior in unexpected ways. For example, fine-tuning for one quality aspect (e.g., safety) may degrade other aspects (e.g., prompt alignment), or may lead to reward hacking (e.g., finding a way to increase rewards without having the intended effect). In this paper, we propose Focus-N-Fix, a region-aware fine-tuning method that trains models to correct only previously problematic image regions. The resulting fine-tuned model generates images with the same high-level structure as the original model but shows significant improvements in regions where the original model was deficient in safety (over-sexualization and violence), plausibility, or other criteria. Our experiments demonstrate that Focus-N-Fix improves these localized quality aspects with little or no degradation to others and typically imperceptible changes in the rest of the image. Disclaimer: This paper contains images that may be overly sexual, violent, offensive, or harmful.

Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation

TL;DR

Focus-N-Fix tackles localized quality issues in text-to-image generation by restricting fine-tuning to problematic image regions rather than globally optimizing image-level rewards. It combines diffusion-based generation with a differentiable reward and a regional constraint, updating only LoRA parameters to preserve the original pretrained model's structure, and it operates with a standard forward pass at inference. The method uses heatmaps or saliency maps to locate artifacts, over-sexualization, violence, or misalignment, achieving localized improvements across multiple quality aspects and generalizing to SDXL and other diffusion backbones. This yields safer, more faithful T2I outputs with reduced risk of forgetting and reward hacking, enabling more reliable deployment in practice.

Abstract

Text-to-image (T2I) generation has made significant advances in recent years, but challenges still remain in the generation of perceptual artifacts, misalignment with complex prompts, and safety. The prevailing approach to address these issues involves collecting human feedback on generated images, training reward models to estimate human feedback, and then fine-tuning T2I models based on the reward models to align them with human preferences. However, while existing reward fine-tuning methods can produce images with higher rewards, they may change model behavior in unexpected ways. For example, fine-tuning for one quality aspect (e.g., safety) may degrade other aspects (e.g., prompt alignment), or may lead to reward hacking (e.g., finding a way to increase rewards without having the intended effect). In this paper, we propose Focus-N-Fix, a region-aware fine-tuning method that trains models to correct only previously problematic image regions. The resulting fine-tuned model generates images with the same high-level structure as the original model but shows significant improvements in regions where the original model was deficient in safety (over-sexualization and violence), plausibility, or other criteria. Our experiments demonstrate that Focus-N-Fix improves these localized quality aspects with little or no degradation to others and typically imperceptible changes in the rest of the image. Disclaimer: This paper contains images that may be overly sexual, violent, offensive, or harmful.
Paper Structure (31 sections, 3 equations, 18 figures, 4 tables, 1 algorithm)

This paper contains 31 sections, 3 equations, 18 figures, 4 tables, 1 algorithm.

Figures (18)

  • Figure 1: Focus-N-Fix applied to reducing artifacts (top) and reducing over-sexualization (bottom). Each row shows: the baseline from Stable Diffusion (SD) v1.4 sd, the image after DRaFT fine-tuning, the one from our region-aware method, Focus-N-Fix, and a heatmap of problematic regions. Unconstrained fine-tuning, as in DraFT, can yield entirely different images for the same prompt as in the STOP sign example (top row) or introduce artifacts (bottom row). Safety rewards are derived from a classifier hao2023safety predicting explicit content (multiplied by -1), while artifact rewards are based on a plausibility score from human feedback maxo. Images are from the test set; heatmaps shown were unseen during training and not used for inference in Focus-N-Fix. Some images use a black box to cover sexually explicit regions. More examples are in Supplementary Material for a better understanding of the results of the proposed method and the baselines.
  • Figure 2: Focus-N-Fix for region-aware finetuning. Given a prompt $\mathbf{c}$ and initial noise sample $\mathbf{x}_T \sim \mathcal{N}(\boldsymbol{0}, \mathbf{I})$, we sample image $\hat{I}_0$ from the pretrained model with parameters $\boldsymbol{\theta}_0$ and image $\hat{I}$ from the fine-tuned model with parameters $\boldsymbol{\theta}$. Problematic regions in $\hat{I}_0$ are identified yielding mask $\mathcal{M}(\hat{I}_0)$. During fine-tuning, we maximize reward $r(\hat{I}, \mathbf{c})$ by modifying masked regions while keeping other areas mostly unchanged, using regional constraint term ${\color{orange} \|(1 - \mathcal{M}(\hat{I}_0)) \odot (\hat{I} - \hat{I}_0) \|_F}$ to penalize changes outside the mask. Inference requires only one forward pass with the fine-tuned model. Focus-N-Fix builds on DRaFT draft, updating only LoRA parameters during fine-tuning.
  • Figure 3: Safety (Over-Sexualization) Qualitative Comparisons. Left to Right: Stable Diffusion v1.4 (SD v1.4), Safe Latent Diffusion (SLD), Reward Guidance (RG), Reward Guidance with Regional Constraints (RG + RC), DraFT, Focus-N-Fix (Ours). A black box was used in some images to to cover sexually explicit regions to limit harm to readers.
  • Figure 4: Artifact Qualitative Comparisons. Left to Right: Stable Diffusion v1.4, Reward Guidance (RG), Reward Guidance with Regional Constraints (RG + RC), DraFT, Focus-N-Fix (Ours).
  • Figure 5: Mean difference in VNLI score between safety (over-sexualization) fine-tuned models & baseline (SD v1.4) for each "challenge" category of PartiPrompts. T-tests performed within each "challenge" category, significance denoted by * ($p < 0.05$)
  • ...and 13 more figures