BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity
Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian
TL;DR
BioTrove tackles the biodiversity AI data gap by delivering a large-scale, richly annotated image corpus derived from iNaturalist Open Data. It introduces BioTrove-Train (≈40M images across 33K species) and BioTrove-CLIP, a family of vision-language foundation models, along with three new benchmarks to probe zero-shot and few-shot generalization across life stages and rare species. Results show BioTrove-CLIP achieves strong average performance across benchmarks, especially on BioTrove-Balanced, while analyses highlight the value of dual-language taxonomic descriptions and the continued role of specialist data for lower taxonomic levels. The accompanying BioTrove-Process tooling enables researchers to construct balanced, taxonomy-aware subsets, accelerating practical biodiversity monitoring and conservation AI.
Abstract
We introduce BioTrove, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted to include only research-grade data, BioTrove contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia ("animals"), Fungi ("fungi"), and Plantae ("plants"), spanning approximately 366.6K species. Each image is annotated with scientific names, taxonomic hierarchies, and common names, providing rich metadata to support accurate AI model development across diverse species and ecosystems. We demonstrate the value of BioTrove by releasing a suite of CLIP models trained using a subset of 40 million captioned images, known as BioTrove-Train. This subset focuses on seven categories within the dataset that are underrepresented in standard image recognition models, selected for their critical role in biodiversity and agriculture: Aves ("birds"), Arachnida ("spiders/ticks/mites"), Insecta ("insects"), Plantae ("plants"), Fungi ("fungi"), Mollusca ("snails"), and Reptilia ("snakes/lizards"). To support rigorous assessment, we introduce several new benchmarks and report model accuracy for zero-shot learning across life stages, rare species, confounding species, and multiple taxonomic levels. We anticipate that BioTrove will spur the development of AI models capable of supporting digital tools for pest control, crop monitoring, biodiversity assessment, and environmental conservation. These advancements are crucial for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. BioTrove is publicly available, easily accessible, and ready for immediate use.
