Table of Contents
Fetching ...

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su

TL;DR

BioCLIP 2 demonstrates that scaling hierarchical contrastive learning on a large, taxonomically structured biological image corpus yields emergent properties that enhance interpretability and generalization beyond species identification. The work combines TreeOfLife-200M with an experience-replay training paradigm to improve performance on diverse biological tasks and reveals two emergent behaviors: inter-species ecological alignment and preservation of intra-species variation in orthogonal subspaces. A formal analysis and extensive ablations explain why scale fosters these properties and how hierarchical supervision facilitates functional clustering without explicit trait labels. The results suggest that domain-specific data scaling, coupled with structured hierarchy, can produce biologically meaningful embeddings useful for conservation, trait analysis, and agricultural applications. These findings position BioCLIP 2 as a strong biology-focused foundation approach and point to scalable strategies for emergent scientific discovery.

Abstract

Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

TL;DR

BioCLIP 2 demonstrates that scaling hierarchical contrastive learning on a large, taxonomically structured biological image corpus yields emergent properties that enhance interpretability and generalization beyond species identification. The work combines TreeOfLife-200M with an experience-replay training paradigm to improve performance on diverse biological tasks and reveals two emergent behaviors: inter-species ecological alignment and preservation of intra-species variation in orthogonal subspaces. A formal analysis and extensive ablations explain why scale fosters these properties and how hierarchical supervision facilitates functional clustering without explicit trait labels. The results suggest that domain-specific data scaling, coupled with structured hierarchy, can produce biologically meaningful embeddings useful for conservation, trait analysis, and agricultural applications. These findings position BioCLIP 2 as a strong biology-focused foundation approach and point to scalable strategies for emergent scientific discovery.

Abstract

Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.

Paper Structure

This paper contains 35 sections, 1 theorem, 10 equations, 15 figures, 8 tables.

Key Result

Theorem 5.1

Let $\boldsymbol{\mu}$ be the prototypes of species, with $\boldsymbol{\mu}_s$ as the prototype of species $s$. Let $\tau$ be temperature. If different $\boldsymbol{\mu}_k$ are nearly orthogonal (i.e., species are well separated), the intra-species variation $\delta$ for species $s$ is constrained b

Figures (15)

  • Figure 1: While BioCLIP 2 is trained to distinguish species, it demonstrates emergent properties beyond the initial training objective. Left: At the inter-species level, the embedding distribution of different species aligns with ecological relationships; the embeddings of Darwin's finches arrange themselves by beak size from left to right. Right: Instead of collapsing, the intra-species variations are preserved in subspaces orthogonal to the inter-species variation (the black lines point from the mean embedding of one variant to that of the other variant). Orthogonality increases with scale (see \ref{['fig:scale-fdr']}).
  • Figure 2: (a) Number of images across organismal biology datasets. (b) Biodiversity comparison across datasets (measured unique 7-tuples for TreeOfLife (ToL) datasets, species count provided by BioTrove). (c) The taxa distributional difference in the Cephalopoda class (octopuses, squids, etc.) between ToL-200M and ToL-10M.
  • Figure 3: (a) The model performance on five downstream tasks under different scales of training data. (b) The model performance on differentiating and aligning different life stages and sexes. (c) The separation and orthogonality evaluation of models trained with different amounts of data.
  • Figure 4: t-SNE embedding visualization of FishNet test set for models trained with different amounts of data. The leftmost plot is the original LAION-2B CLIP ViT-L/14. As the training data scales, freshwater fish become more distinct from saltwater fish and brackish fish, despite no explicit supervision, demonstrating that data scale contributes to emergent properties in model representations.
  • Figure 5: The embedding distribution of life stage variations under different scales of training data. The 2D distributions are obtained using t-SNE. For the 3D distributions, we first run SVD with the mean embedding of each species. The first two singular vectors are used to construct the gray plane that captures most inter-species differences. The embeddings are then projected into the 3D space with an additional orthogonal dimension. The straight lines point from the mean embedding of juvenile images to that of the adult images. As the training scales up, the intra-species variations are preserved in the subspace orthogonal to the inter-species differences. Orthogonality improves with data scale, as evidenced by the decreasing explained-variance ratio $\rho$.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Theorem 5.1
  • proof
  • proof