Table of Contents
Fetching ...

Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods

Panos Achlioptas, Alexandros Benetatos, Iordanis Fostiropoulos, Dimitris Skourtis

TL;DR

This work tackles human-centric personalized text-to-image generation by introducing Stellar, a large-scale dataset of imaginative prompts paired with 400 identities, and a comprehensive, interpretable metric suite that isolates identity fidelity from object-grounding. It then presents StellarNet, a dynamic textual inversion-based baseline that leverages SDXL with LoRA to personalize outputs without per-subject fine-tuning, achieving strong human-preference performance. The authors demonstrate that their identity- and object-centric metrics correlate more with human judgments than existing measures and show StellarNet outperforms prior personalized generators across multiple evaluations. Together, Stellar data, metrics, and baseline provide a standardized platform to advance and fairly compare personalized T2I methods, while highlighting ethical considerations for real-world use.

Abstract

In this work, we systematically study the problem of personalized text-to-image generation, where the output image is expected to portray information about specific human subjects. E.g., generating images of oneself appearing at imaginative places, interacting with various items, or engaging in fictional activities. To this end, we focus on text-to-image systems that input a single image of an individual to ground the generation process along with text describing the desired visual context. Our first contribution is to fill the literature gap by curating high-quality, appropriate data for this task. Namely, we introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available. Having established Stellar to promote cross-systems fine-grained comparisons further, we introduce a rigorous ensemble of specialized metrics that highlight and disentangle fundamental properties such systems should obey. Besides being intuitive, our new metrics correlate significantly more strongly with human judgment than currently used metrics on this task. Last but not least, drawing inspiration from the recent works of ELITE and SDXL, we derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA. For more information, please visit our project's website: https://stellar-gen-ai.github.io/.

Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods

TL;DR

This work tackles human-centric personalized text-to-image generation by introducing Stellar, a large-scale dataset of imaginative prompts paired with 400 identities, and a comprehensive, interpretable metric suite that isolates identity fidelity from object-grounding. It then presents StellarNet, a dynamic textual inversion-based baseline that leverages SDXL with LoRA to personalize outputs without per-subject fine-tuning, achieving strong human-preference performance. The authors demonstrate that their identity- and object-centric metrics correlate more with human judgments than existing measures and show StellarNet outperforms prior personalized generators across multiple evaluations. Together, Stellar data, metrics, and baseline provide a standardized platform to advance and fairly compare personalized T2I methods, while highlighting ethical considerations for real-world use.

Abstract

In this work, we systematically study the problem of personalized text-to-image generation, where the output image is expected to portray information about specific human subjects. E.g., generating images of oneself appearing at imaginative places, interacting with various items, or engaging in fictional activities. To this end, we focus on text-to-image systems that input a single image of an individual to ground the generation process along with text describing the desired visual context. Our first contribution is to fill the literature gap by curating high-quality, appropriate data for this task. Namely, we introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available. Having established Stellar to promote cross-systems fine-grained comparisons further, we introduce a rigorous ensemble of specialized metrics that highlight and disentangle fundamental properties such systems should obey. Besides being intuitive, our new metrics correlate significantly more strongly with human judgment than currently used metrics on this task. Last but not least, drawing inspiration from the recent works of ELITE and SDXL, we derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA. For more information, please visit our project's website: https://stellar-gen-ai.github.io/.
Paper Structure (28 sections, 7 equations, 4 figures, 3 tables)

This paper contains 28 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Left: User preferences between StellarNet and existing personalized approaches. StellarNet's output is preferred by a very large margin (78% of all trials). Right: Kendall's ($\tau$) correlation among existing and our introduced metrics. Our metrics correlate significantly better with key aspects of personalized generations and human-preference.
  • Figure 2: StellarNet overview. The Dynamic Textual Inversion (DTI) module inverts the foreground-masked identity image (found in CelebAMask-HQ CelebAMask-HQ) into textual embeddings, $S^*$ (left). The $S^*$ augments the textual prompt passed to the pre-trained text-to-image model (SDXL sdxl) to guide the model into generating images with the given identity (middle). Additionally, we finetune the UNet backbone of SDXL using LoRA weight-offsets hu2022lora for efficient and stable training (middle bottom). We apply a masked MSE loss during training over the input image and the output generation (right).
  • Figure 3: StellarNet generations grounded on different noise-controlling random seeds. StellarNet produces rich variations given a human subject (top row CelebAMask-HQ) and a fixed prompt (bottom of each column). Best viewed by zooming in on the digital version.
  • Figure 4: Qualitative comparison of StellarNet vs. SoTA personalized-T2I methods. The leftmost column depicts the input image from CelebAMask-HQ CelebAMask-HQ portraying the actor's identity (marked in text as $\mathbf{S^*}$). The four rightmost images are generations based on the system delineated in the column's title. All methods input the corresponding prompt shown next to each row. Additionally, with every generation, we include five colored circles representing the preference of the five metrics listed at the bottom of the figure. An opaque circle (e.g., ), indicates the image with the highest score for each metric among the generations of the same row.