Table of Contents
Fetching ...

RetriBooru: Leakage-Free Retrieval of Conditions from Reference Images for Subject-Driven Generation

Haoran Tang, Jieren Deng, Zhihong Pan, Hao Tian, Pratik Chaudhari, Xin Zhou

TL;DR

A multi-level, same-identity dataset RetriBooru is proposed, which groups anime characters by both face and cloth identities and introduces a novel class of metrics named Similarity Weighted Diversity (SWD), to measure the overlooked diversity and better evaluate the alignment between similarity and diversity.

Abstract

Diffusion-based methods have demonstrated remarkable capabilities in generating a diverse array of high-quality images, sparking interests for styled avatars, virtual try-on, and more. Previous methods use the same reference image as the target. An overlooked aspect is the leakage of the target's spatial information, style, etc. from the reference, harming the generated diversity and causing shortcuts. However, this approach continues as widely available datasets usually consist of single images not grouped by identities, and it is expensive to recollect large-scale same-identity data. Moreover, existing metrics adopt decoupled evaluation on text alignment and identity preservation, which fail at distinguishing between balanced outputs and those that over-fit to one aspect. In this paper, we propose a multi-level, same-identity dataset RetriBooru, which groups anime characters by both face and cloth identities. RetriBooru enables adopting reference images of the same character and outfits as the target, while keeping flexible gestures and actions. We benchmark previous methods on our dataset, and demonstrate the effectiveness of training with a reference image different from target (but same identity). We introduce a new concept composition task, where the conditioning encoder learns to retrieve different concepts from several reference images, and modify a baseline network RetriNet for the new task. Finally, we introduce a novel class of metrics named Similarity Weighted Diversity (SWD), to measure the overlooked diversity and better evaluate the alignment between similarity and diversity.

RetriBooru: Leakage-Free Retrieval of Conditions from Reference Images for Subject-Driven Generation

TL;DR

A multi-level, same-identity dataset RetriBooru is proposed, which groups anime characters by both face and cloth identities and introduces a novel class of metrics named Similarity Weighted Diversity (SWD), to measure the overlooked diversity and better evaluate the alignment between similarity and diversity.

Abstract

Diffusion-based methods have demonstrated remarkable capabilities in generating a diverse array of high-quality images, sparking interests for styled avatars, virtual try-on, and more. Previous methods use the same reference image as the target. An overlooked aspect is the leakage of the target's spatial information, style, etc. from the reference, harming the generated diversity and causing shortcuts. However, this approach continues as widely available datasets usually consist of single images not grouped by identities, and it is expensive to recollect large-scale same-identity data. Moreover, existing metrics adopt decoupled evaluation on text alignment and identity preservation, which fail at distinguishing between balanced outputs and those that over-fit to one aspect. In this paper, we propose a multi-level, same-identity dataset RetriBooru, which groups anime characters by both face and cloth identities. RetriBooru enables adopting reference images of the same character and outfits as the target, while keeping flexible gestures and actions. We benchmark previous methods on our dataset, and demonstrate the effectiveness of training with a reference image different from target (but same identity). We introduce a new concept composition task, where the conditioning encoder learns to retrieve different concepts from several reference images, and modify a baseline network RetriNet for the new task. Finally, we introduce a novel class of metrics named Similarity Weighted Diversity (SWD), to measure the overlooked diversity and better evaluate the alignment between similarity and diversity.
Paper Structure (21 sections, 3 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 3 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: Training the proposed concept composition task on our RetriBooru dataset. Different concepts to retrieve are specified in texts and passed to the retrieval encoder, which learns only from characteristic information to compose the output, guiding generation with text prompts in the U-Net.
  • Figure 2: Details of RetriBooru dataset. Left: annotations of an individual sample. Right: distributions of lengths of the "similar" lists, top-15 characters with most appearances, and top-30 cloth tags.
  • Figure 2: Comparison of models by CLIP-I and CLIP-T scores. Left: average scores across validation prompts. Right: A scatter plot that discloses the balance between text and image prompt alignment. Overall, $\texttt{IP-Adapter-0.5-b}$ achieves the best balance.
  • Figure 3: Qualitative results of IP-Adapter trained on RetriBooru. We choose two image-prompt pairs and compare with different scales. Each row has the same scale and each column has the same seed. Our -b pipeline provides better balanced results given a fixed scale, and keeps fusing image and text conditioning at various scales, outputting good generation even when -a scale is off.
  • Figure 3: Baseline results on RetriBooru. Left: Average scores across validation prompts, and we mark the highest for each task. Right: A scatter plot that discloses the balance between diversity (text) and similarity (image). Longer training achieves better CLIP-I CLIP-T balance.
  • ...and 6 more figures