BioCLIP: A Vision Foundation Model for the Tree of Life
Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
TL;DR
BioCLIP addresses the need for a vision foundation model tailored to the tree of life by combining CLIP-style multimodal learning with a large, taxonomy-rich biology image dataset, TreeOfLife-10M. By flattening the taxonomy into taxonomic name strings and training with mixed text types, BioCLIP learns hierarchical, fine-grained representations that generalize to unseen taxa in zero-shot and few-shot settings. The work demonstrates consistent 16–17 percentage-point gains over baselines across ten diverse tasks, and provides intrinsic evidence that the model captures taxonomic hierarchy in its embeddings. The released dataset, Rare Species test, and open-source code establish a foundation for scalable, taxonomy-aware biological vision research with broad applications in biodiversity monitoring and conservation.
Abstract
Images of the natural world, collected by a variety of cameras, from drones to individual phones, are increasingly abundant sources of biological information. There is an explosion of computational methods and tools, particularly computer vision, for extracting biologically relevant information from images for science and conservation. Yet most of these are bespoke approaches designed for a specific task and are not easily adaptable or extendable to new questions, contexts, and datasets. A vision model for general organismal biology questions on images is of timely need. To approach this, we curate and release TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images. We then develop BioCLIP, a foundation model for the tree of life, leveraging the unique properties of biology captured by TreeOfLife-10M, namely the abundance and variety of images of plants, animals, and fungi, together with the availability of rich structured biological knowledge. We rigorously benchmark our approach on diverse fine-grained biology classification tasks and find that BioCLIP consistently and substantially outperforms existing baselines (by 16% to 17% absolute). Intrinsic evaluation reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability. https://imageomics.github.io/bioclip has models, data and code.
