Table of Contents
Fetching ...

Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias

Mingxiao Li, Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens

TL;DR

This work tackles overfitting and evaluation bias in personalized image generation by introducing a Background Attractor to separate subject and background, and by curating PDST, a dedicated test set for unbiased automatic evaluation. The approach couples a latent diffusion framework with Textual Inversion and NeTI, augmented by a contrastive loss and a background-specific attractor, to improve subject fidelity while preserving versatility across prompts. Key contributions include the attractor-based learning pipeline, the PDST benchmark for robust evaluation, and comprehensive ablations showing the importance of loss weighting and background disentanglement. Practically, the method yields more reliable automatic metrics and higher-quality, text-aligned personalizations, facilitating safer and more effective real-world deployment.

Abstract

Personalized image generation via text prompts has great potential to improve daily life and professional work by facilitating the creation of customized visual content. The aim of image personalization is to create images based on a user-provided subject while maintaining both consistency of the subject and flexibility to accommodate various textual descriptions of that subject. However, current methods face challenges in ensuring fidelity to the text prompt while not overfitting to the training data. In this work, we introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images, allowing the model to focus on learning an effective representation of the personalized subject. Moreover, current evaluation methods struggle due to the lack of a dedicated test set. The evaluation set-up typically relies on the training data of the personalization task to compute text-image and image-image similarity scores, which, while useful, tend to overestimate performance. Although human evaluations are commonly used as an alternative, they often suffer from bias and inconsistency. To address these issues, we curate a diverse and high-quality test set with well-designed prompts. With this new benchmark, automatic evaluation metrics can reliably assess model performance

Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias

TL;DR

This work tackles overfitting and evaluation bias in personalized image generation by introducing a Background Attractor to separate subject and background, and by curating PDST, a dedicated test set for unbiased automatic evaluation. The approach couples a latent diffusion framework with Textual Inversion and NeTI, augmented by a contrastive loss and a background-specific attractor, to improve subject fidelity while preserving versatility across prompts. Key contributions include the attractor-based learning pipeline, the PDST benchmark for robust evaluation, and comprehensive ablations showing the importance of loss weighting and background disentanglement. Practically, the method yields more reliable automatic metrics and higher-quality, text-aligned personalizations, facilitating safer and more effective real-world deployment.

Abstract

Personalized image generation via text prompts has great potential to improve daily life and professional work by facilitating the creation of customized visual content. The aim of image personalization is to create images based on a user-provided subject while maintaining both consistency of the subject and flexibility to accommodate various textual descriptions of that subject. However, current methods face challenges in ensuring fidelity to the text prompt while not overfitting to the training data. In this work, we introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images, allowing the model to focus on learning an effective representation of the personalized subject. Moreover, current evaluation methods struggle due to the lack of a dedicated test set. The evaluation set-up typically relies on the training data of the personalization task to compute text-image and image-image similarity scores, which, while useful, tend to overestimate performance. Although human evaluations are commonly used as an alternative, they often suffer from bias and inconsistency. To address these issues, we curate a diverse and high-quality test set with well-designed prompts. With this new benchmark, automatic evaluation metrics can reliably assess model performance

Paper Structure

This paper contains 19 sections, 6 equations, 16 figures.

Figures (16)

  • Figure 1: Changes in DINO image-image similarity and CLIP text-image similarity scores for the DreamBooth model across training steps. The DINO score, evaluated on a separate test set, shows a stronger correlation with the CLIP text similarity score. In contrast, the DINO score on the training set remains nearly unchanged, even after the model has fully overfit to the training data.
  • Figure 2: Dataset Construction Process. The process consists of image collection and caption generation. We begin by manually collecting images from Unsplash unsplash and Pexels pexel, supplemented by our own photography. To ensure high-quality data, all images undergo manual inspection and filtering by human evaluators. Prompt generation is a two-step process: first, Qwen2-VL generates initial captions for each image. These captions are then refined through human editing and GPT-4o to enhance clarity and correctness. Additionally, any text that might reveal subject-specific information is carefully removed to maintain neutrality.
  • Figure 3: Left: A word cloud visualization of the captions in the test set. Right: The top 30 most frequent words in the test set captions, with stopwords removed.
  • Figure 4: Overview of our proposed training pipeline. (a) illustrates our disentangled training losses, which include mask background loss, mask subject loss, joint loss, and contrastive loss. (b) demonstrates how we obtain learnable representations for the target subject and background attractor when applying our pipeline to Textual Inversion ti and NeTI neti.
  • Figure 5: Quantitative evaluation: Comparing CLIP/DINO image similarity versus text similarity, with bubble sizes indicating text similarity scores, where a larger size corresponds to a better score.
  • ...and 11 more figures