Table of Contents
Fetching ...

TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life

Mridul Khurana, Amin Karimi Monsefi, Justin Lee, Medha Sawhney, David Carlyn, Julia Chae, Jianyang Gu, Rajiv Ramnath, Sara Beery, Wei-Lun Chao, Anuj Karpatne, Cheng Zhang

Abstract

Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.

TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life

Abstract

Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.

Paper Structure

This paper contains 32 sections, 5 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Existing text-to-image models struggle with fine-grained species details. We propose TaxaAdapter to inject VTM-derived embeddings into a frozen diffusion model to make fine-grained synthesis taxonomy-aware, enabling: (a) fine-grained species synthesis with accurate morphological details, (b) diverse and high-fidelity trait synthesis, (c) free-form text control, and (d) strong out-of-distribution generalization.
  • Figure 2: TaxaAdapter pipeline overview. Given a taxonomic name (e.g., Kingdom$\rightarrow$Species), we extract taxonomy-image aligned embeddings using a pre-trained vision taxonomy model (e.g., BioCLIP gu2025bioclip, BioTrove-CLIP yang2024biotrove or TaxaBind sastry2025taxabind) and obtain complementary text features from the frozen CLIP text encoder. The dual conditioning streams are fused through a decoupled cross-attention mechanism, where the taxonomy branch captures species-level traits and the text branch retains free-form control over contextual cues such as style, background, or pose. During training, we only update the projection and cross-attention layers, while the diffusion backbone remains frozen for efficient and stable adaptation.
  • Figure 3: Caption-based trait fidelity evaluation. We leverage an MLLM to generate trait captions for real and generated images, summarize each set into a species-level trait description with an LLM, and compute text similarity between the two summaries. Our metric provides an interpretable, trait-level measure of morphological fidelity that complements standard image metrics.
  • Figure 4: Qualitative comparison on TreeOfLife-1M. Each row shows a different species spanning birds, mammals, insects, and reptiles. TaxaAdapter generates morphology-faithful images that align with taxonomy-defining traits (e.g., texture patterns, body shape, coloration) while maintaining realistic textures and backgrounds. Notably, the first two rows illustrate two extremely similar species under the same Genus, where subtle differences in white spotting patterns on bird wings (highlighted in yellow circles) are correctly captured. Please see Appendix \ref{['sup:additional_qualitative']} for additional qualitative results.
  • Figure 5: Quantitative OOD evaluation on 51 CUB-200-2011 species unseen during training. Models are trained on iNat-mini and tested on unseen CUB classes.
  • ...and 8 more figures