BioCLIP: A Vision Foundation Model for the Tree of Life

Samuel Stevens; Jiaman Wu; Matthew J Thompson; Elizabeth G Campolongo; Chan Hee Song; David Edward Carlyn; Li Dong; Wasila M Dahdul; Charles Stewart; Tanya Berger-Wolf; Wei-Lun Chao; Yu Su

BioCLIP: A Vision Foundation Model for the Tree of Life

Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su

TL;DR

BioCLIP addresses the need for a vision foundation model tailored to the tree of life by combining CLIP-style multimodal learning with a large, taxonomy-rich biology image dataset, TreeOfLife-10M. By flattening the taxonomy into taxonomic name strings and training with mixed text types, BioCLIP learns hierarchical, fine-grained representations that generalize to unseen taxa in zero-shot and few-shot settings. The work demonstrates consistent 16–17 percentage-point gains over baselines across ten diverse tasks, and provides intrinsic evidence that the model captures taxonomic hierarchy in its embeddings. The released dataset, Rare Species test, and open-source code establish a foundation for scalable, taxonomy-aware biological vision research with broad applications in biodiversity monitoring and conservation.

Abstract

Images of the natural world, collected by a variety of cameras, from drones to individual phones, are increasingly abundant sources of biological information. There is an explosion of computational methods and tools, particularly computer vision, for extracting biologically relevant information from images for science and conservation. Yet most of these are bespoke approaches designed for a specific task and are not easily adaptable or extendable to new questions, contexts, and datasets. A vision model for general organismal biology questions on images is of timely need. To approach this, we curate and release TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images. We then develop BioCLIP, a foundation model for the tree of life, leveraging the unique properties of biology captured by TreeOfLife-10M, namely the abundance and variety of images of plants, animals, and fungi, together with the availability of rich structured biological knowledge. We rigorously benchmark our approach on diverse fine-grained biology classification tasks and find that BioCLIP consistently and substantially outperforms existing baselines (by 16% to 17% absolute). Intrinsic evaluation reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability. https://imageomics.github.io/bioclip has models, data and code.

BioCLIP: A Vision Foundation Model for the Tree of Life

TL;DR

Abstract

Paper Structure (25 sections, 6 figures, 11 tables)

This paper contains 25 sections, 6 figures, 11 tables.

Introduction
TreeOfLife-10M
Images
Metadata & Aggregation
Release & Statistics
Modeling
Why CLIP?
Text Types
Experiments
Training and Evaluation Details
Can BioCLIP Generalize to Unseen Taxa?
How Do Text Types Affect Generalization?
Is the CLIP Objective Necessary?
Can BioCLIP Classify More Than Species?
Does BioCLIP Learn the Hierarchy?
...and 10 more sections

Figures (6)

Figure 1: (a) Two taxa, or taxonomic labels, for two different plants, Onoclea sensibilis (d) and Onoclea hintonii (e). These taxa are identical except for the species. (b) The autoregressive text encoder naturally encodes the hierarchical structure of the taxonomy. See how the Order token(s) (Polypodiales) can incorporate information from the Kingdom, Phylum and Class tokens, but nothing later in the hierarchy. This helps align the visual representations to this same hierarchical structure (see \ref{['subsec:intrinsic-eval']}). (c) These hierarchical representations of taxonomic labels are fed into the standard contrastive pre-training objective and are matched with image representations (d) and (e).
Figure 2: Treemap of the 108.0 phyla in TreeOfLife-10M. Different colors are different phyla; nested boxes represent classes, orders, and families. Box size, not number of inner boxes, represents relative number of samples.
Figure 3: T-SNE visualization of image features, colored by taxonomic labels. BioCLIP (B) is visualized in the first and third row and OpenAI's CLIP (O) is visualized in the second and fourth rows. BioCLIP's features better preserve the hierarchical structure: while both BioCLIP and CLIP cleanly separate the phyla in the Animalia Kingdom (top left), only BioCLIP successfully separates the orders in the Insecta Class (top right) and the families in the Lepidoptera Order (bottom left).
Figure F2: Example predictions for BioCLIP and CLIP on Birds 525, Plankton, Insects, Insects2, PlantNet and Fungi tasks. Ground truth labels are green; incorrect predictions are red. Left: Correct BioCLIP predictions. Center, Right: Images that CLIP incorrectly labels, but BioCLIP correctly labels.
Figure F3: Example predictions for BioCLIP and CLIP on PlantVillage, Medicinal Leaf, PlantDoc and Rare Species. Ground truth labels are green; incorrect predictions are red. Left: Correct BioCLIP predictions. Center, Right: Images that CLIP incorrectly labels, but BioCLIP correctly labels.
...and 1 more figures

BioCLIP: A Vision Foundation Model for the Tree of Life

TL;DR

Abstract

BioCLIP: A Vision Foundation Model for the Tree of Life

Authors

TL;DR

Abstract

Table of Contents

Figures (6)