Table of Contents
Fetching ...

Not All Birds Look The Same: Identity-Preserving Generation For Birds

Aaron Sun, Oindrila Saha, Subhransu Maji

TL;DR

The study addresses the difficulty of identity-preserving generation in fine-grained, non-rigid domains by introducing NABLA, a birds-focused benchmark built from expert-look-alike image pairs and iNaturalist test pairs. It adapts zero-shot methods (OminiControl and Insert Anything) with proxy identity training based on species, age, and sex to improve fidelity and generalization to unseen species. Quantitative results show strong correlation between NABLA performance and true identity, including a notable 41% reduction in MSE on NABLA over baselines, with inpainting typically outperforming depth-based controls. The work also analyzes intra- and inter-species results, reveals limitations of current control modes, and suggests future directions for more robust, flexible identity-preserving generation in scientific visualization and education.

Abstract

Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users. Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning. While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data -- especially videos or multi-view observations of the same subject -- making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail. Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds. We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex -- used as a proxy for identity -- substantially improves performance on both seen and unseen species.

Not All Birds Look The Same: Identity-Preserving Generation For Birds

TL;DR

The study addresses the difficulty of identity-preserving generation in fine-grained, non-rigid domains by introducing NABLA, a birds-focused benchmark built from expert-look-alike image pairs and iNaturalist test pairs. It adapts zero-shot methods (OminiControl and Insert Anything) with proxy identity training based on species, age, and sex to improve fidelity and generalization to unseen species. Quantitative results show strong correlation between NABLA performance and true identity, including a notable 41% reduction in MSE on NABLA over baselines, with inpainting typically outperforming depth-based controls. The work also analyzes intra- and inter-species results, reveals limitations of current control modes, and suggests future directions for more robust, flexible identity-preserving generation in scientific visualization and education.

Abstract

Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users. Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning. While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data -- especially videos or multi-view observations of the same subject -- making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail. Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds. We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex -- used as a proxy for identity -- substantially improves performance on both seen and unseen species.

Paper Structure

This paper contains 28 sections, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Qualitative results for bird generation using our method. Generations match the pose of the target or control image but the appearance of the subject image. Left: Each row represents a different species subject being adapted into the same pose via a different control mode. Right: Reposing two similar species into the same canonical pose provides easier comparison between the two. Our generated images exhibit greater consistency and identity preservation than existing state-of-the-art approaches (Please zoom in for details; Species names are given in the bottom left of each image).
  • Figure 2: Failure cases for baselines compared to our method. Generated results should match the subject identity and target pose. However, the baseline models frequently change subject characteristics (rows 2, 3, and 4) and target pose (rows 1, 2, and 5). In contrast, our model generations correctly match the target image, as subject and target share an apparent identity in NABLA. The propriety models were provided the subject image, masked background, and the prompt "Please inpaint this bird into the pose given by the black mask." Our model is the fine-tuned baseline in the first three rows, and for proprietary models we show our OminiControl with FLUX-Kontext model results.
  • Figure 3: Image pair examples from 4 datasets. Though iNaturalist and SSW60 have true identity-preservation, other qualities such as image quality and motion blur make them poor for image generation. NABirds consists of single-subject, high-quality images, but has inconsistent identity, even within classes. In contrast, NABLA has expert-verified lookalike bird pairs on high-quality NABirds images.
  • Figure 4: Dataset usage and pipeline. Left: In training, only images within the same class (shown in pink) are considered as pairs for sampling. Classes can vary in their hierarchy (species-level, gender-level, etc.) dependent on species. Right: During evaluation, the subject image and the control of the target image are inputted to the model for generation. The generation and target image are masked and the birds are evaluated using DINO, SigLIP, LPIPS, and MSE.
  • Figure 5: Varying control mode settings for training and evaluation. The top row shows model inputs and the bottom row shows fine-tuned model outputs and scores.
  • ...and 16 more figures