Table of Contents
Fetching ...

NearID: Identity Representation Learning via Near-identity Distractors

Aleksandar Cvejic, Rameen Abdal, Abdelrahman Eldesokey, Bernard Ghanem, Peter Wonka

Abstract

When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/

NearID: Identity Representation Learning via Near-identity Distractors

Abstract

When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/

Paper Structure

This paper contains 47 sections, 8 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: NearID: From Context to Identity.(Left) Traditional representations entangle object identity with background context. (Middle) Synthetic Data lacks explicit control over visually similar distractors. (Right)NearID introduces matched-context distractors to remove contextual shortcuts and isolate intrinsic identity signals.
  • Figure 2: NearID overview.Left: In the pretrained image-embedding manifold (e.g., SigLIP2 tschannen2025siglip2), identity-consistent positives can be misaligned: edited or re-rendered views of the same instance may lie farther from the anchor than visually confusable negatives (NearID distractors illustrated by $d_i < d_j$), which degrades retrieval and scoring reliability. Right: We keep the SigLIP2 image encoder frozen and train only a lightweight MAP zhai2022scaling_MAPjaegle2021perceiver_MAP2 projection head to reshape the similarity geometry for the target task. Compared to the frozen baseline, the trained head increases similarity for true positives (green) while pushing NearID distractors and unrelated gallery items lower (red), improving the desired ordering of similarity scores with respect to a gallery.
  • Figure 3: Left: Attention map comparison vs baseline on positives and negatives. Right: Summary of properties evaluated in this paper. NearID improves matched-context Near-ID rejection on NearID-bench and transfers to part-level identity evaluation on MTG, while remaining mask-free at inference.
  • Figure 4: Per-category human alignment ($M-H$) on DreamBench++ peng2024dreambench++. Each radar shows the Pearson correlation between a metric and human concept-preservation judgments across four DB++ categories; NearID (Ours) is repeated as a dashed reference in each subplot. Despite training exclusively on rigid objects, NearID improves over the frozen SigLIP2 baseline on Animal ($+0.105$) and Human ($+0.065$), indicating that disentangling identity from context transfers across semantic domains. The expected decrease on Style ($-0.092$), a category entirely absent from training, confirms that the gains reflect genuine identity learning rather than general score inflation.
  • Figure 5: KernelPCA visualization of Near-Identity separation. We project embeddings for $n{=}7$ identities (S1--S7) into 2D using identical KernelPCA scholkopf1998nonlinear_kernelpca settings in both panels. Circles denote positives (same identity) and crosses denote matched-context NearID distractors (negatives). Compared to the frozen SigLIP2 baseline, NearID increases separation by pushing the distractors away from the corresponding positive clusters, providing visual evidence of improved near-identity discrimination.
  • ...and 5 more figures