Table of Contents
Fetching ...

Finer-Personalization Rank: Fine-Grained Retrieval Examines Identity Preservation for Personalized Generation

Connor Kilrain, David Carlyn, Julia Chae, Sara Beery, Wei-Lun Chao, Jianyang Gu

TL;DR

The paper addresses how to evaluate identity preservation in subject-driven personalized generation, arguing that existing similarity-based metrics miss fine-grained identity details. It proposes Finer-Personalization Rank, a gallery-based retrieval protocol where a generated image is used to rank real images from a fine-grained gallery, with mean average precision capturing identity retention across category and instance levels. Experiments on CUB, Stanford Cars, and Animal Re-ID show substantial identity drift in popular personalization methods when evaluated with the proposed protocol, and demonstrate that specialized encoders improve detection of identity-specific details. The protocol is presented as a complementary, cost-efficient tool for developing and validating personalized generation systems with real-world user identity requirements.

Abstract

The rise of personalized generative models raises a central question: how should we evaluate identity preservation? Given a reference image (e.g., one's pet), we expect the generated image to retain precise details attached to the subject's identity. However, current generative evaluation metrics emphasize the overall semantic similarity between the reference and the output, and overlook these fine-grained discriminative details. We introduce Finer-Personalization Rank, an evaluation protocol tailored to identity preservation. Instead of pairwise similarity, Finer-Personalization Rank adopts a ranking view: it treats each generated image as a query against an identity-labeled gallery consisting of visually similar real images. Retrieval metrics (e.g., mean average precision) measure performance, where higher scores indicate that identity-specific details (e.g., a distinctive head spot) are preserved. We assess identity at multiple granularities -- from fine-grained categories (e.g., bird species, car models) to individual instances (e.g., re-identification). Across CUB, Stanford Cars, and animal Re-ID benchmarks, Finer-Personalization Rank more faithfully reflects identity retention than semantic-only metrics and reveals substantial identity drift in several popular personalization methods. These results position the gallery-based protocol as a principled and practical evaluation for personalized generation.

Finer-Personalization Rank: Fine-Grained Retrieval Examines Identity Preservation for Personalized Generation

TL;DR

The paper addresses how to evaluate identity preservation in subject-driven personalized generation, arguing that existing similarity-based metrics miss fine-grained identity details. It proposes Finer-Personalization Rank, a gallery-based retrieval protocol where a generated image is used to rank real images from a fine-grained gallery, with mean average precision capturing identity retention across category and instance levels. Experiments on CUB, Stanford Cars, and Animal Re-ID show substantial identity drift in popular personalization methods when evaluated with the proposed protocol, and demonstrate that specialized encoders improve detection of identity-specific details. The protocol is presented as a complementary, cost-efficient tool for developing and validating personalized generation systems with real-world user identity requirements.

Abstract

The rise of personalized generative models raises a central question: how should we evaluate identity preservation? Given a reference image (e.g., one's pet), we expect the generated image to retain precise details attached to the subject's identity. However, current generative evaluation metrics emphasize the overall semantic similarity between the reference and the output, and overlook these fine-grained discriminative details. We introduce Finer-Personalization Rank, an evaluation protocol tailored to identity preservation. Instead of pairwise similarity, Finer-Personalization Rank adopts a ranking view: it treats each generated image as a query against an identity-labeled gallery consisting of visually similar real images. Retrieval metrics (e.g., mean average precision) measure performance, where higher scores indicate that identity-specific details (e.g., a distinctive head spot) are preserved. We assess identity at multiple granularities -- from fine-grained categories (e.g., bird species, car models) to individual instances (e.g., re-identification). Across CUB, Stanford Cars, and animal Re-ID benchmarks, Finer-Personalization Rank more faithfully reflects identity retention than semantic-only metrics and reveals substantial identity drift in several popular personalization methods. These results position the gallery-based protocol as a principled and practical evaluation for personalized generation.

Paper Structure

This paper contains 30 sections, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Left: Given a specific dog as the personalized concept to preserve, two generated images fail to capture the distinctive spot and color patterns on the head. However, popularly adopted evaluation metrics, including CLIP similarity and GPT-based DreamBench++ (DB++) peng2025dreambench++, still give high concept-preserving scores. Right: We propose to evaluate personalization models for their identity preservation capabilities by probing a gallery of visually similar real images with the generated image.
  • Figure 2: Left: Current evaluation on concept preservation uses general-purpose encoders or GPT to produce a pair-wise similarity score with the reference image. Right: We propose a gallery-based protocol to evaluate identity preservation. The generated image is used to probe a gallery of similar identities. We use mean average precision (mAP) as the measurement. The protocol demands more focus on the retention of fine-grained discriminative details and is more useful in subject-driven personalization scenarios.
  • Figure 3: Pair-wise similarity scores (the first row) and gallery-based mAP scores (the second row) of seven personalization models across three benchmarks. The proposed gallery-based protocol more clearly reveals identity drift of these models compared with the overall high similarity scores. Meanwhile, specialized models provide more faithful views to examine identity-related discriminative details.
  • Figure 4: Qualitative comparisons between pair-wise similarity scores and the gallery-based Finer-Personalization Rank. Each group shows the reference image used in prompting generation, the generated image, and the retrieval list using the generated image as the query. Our evaluation protocol reveals the variations in subtle details and demonstrates robustness against context variations.
  • Figure 5: Ablation study on the gallery construction. We vary the image number per subject (left) and the subject number (right) in the gallery. "Sim" refers to Similarity. The generated performance is averaged over all adopted personalization models. Adding more subjects does not change pair-wise similarities, as the extra subjects are not involved in the calculation.
  • ...and 4 more figures