BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

Chih-Hsuan Yang; Benjamin Feuer; Zaki Jubery; Zi K. Deng; Andre Nakkab; Md Zahid Hasan; Shivani Chiranjeevi; Kelly Marshall; Nirmal Baishnab; Asheesh K Singh; Arti Singh; Soumik Sarkar; Nirav Merchant; Chinmay Hegde; Baskar Ganapathysubramanian

BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian

TL;DR

BioTrove tackles the biodiversity AI data gap by delivering a large-scale, richly annotated image corpus derived from iNaturalist Open Data. It introduces BioTrove-Train (≈40M images across 33K species) and BioTrove-CLIP, a family of vision-language foundation models, along with three new benchmarks to probe zero-shot and few-shot generalization across life stages and rare species. Results show BioTrove-CLIP achieves strong average performance across benchmarks, especially on BioTrove-Balanced, while analyses highlight the value of dual-language taxonomic descriptions and the continued role of specialist data for lower taxonomic levels. The accompanying BioTrove-Process tooling enables researchers to construct balanced, taxonomy-aware subsets, accelerating practical biodiversity monitoring and conservation AI.

Abstract

We introduce BioTrove, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted to include only research-grade data, BioTrove contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia ("animals"), Fungi ("fungi"), and Plantae ("plants"), spanning approximately 366.6K species. Each image is annotated with scientific names, taxonomic hierarchies, and common names, providing rich metadata to support accurate AI model development across diverse species and ecosystems. We demonstrate the value of BioTrove by releasing a suite of CLIP models trained using a subset of 40 million captioned images, known as BioTrove-Train. This subset focuses on seven categories within the dataset that are underrepresented in standard image recognition models, selected for their critical role in biodiversity and agriculture: Aves ("birds"), Arachnida ("spiders/ticks/mites"), Insecta ("insects"), Plantae ("plants"), Fungi ("fungi"), Mollusca ("snails"), and Reptilia ("snakes/lizards"). To support rigorous assessment, we introduce several new benchmarks and report model accuracy for zero-shot learning across life stages, rare species, confounding species, and multiple taxonomic levels. We anticipate that BioTrove will spur the development of AI models capable of supporting digital tools for pest control, crop monitoring, biodiversity assessment, and environmental conservation. These advancements are crucial for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. BioTrove is publicly available, easily accessible, and ready for immediate use.

BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

TL;DR

Abstract

Paper Structure (25 sections, 5 figures, 9 tables)

This paper contains 25 sections, 5 figures, 9 tables.

Introduction
The BioTrove Dataset
Characteristics.
Dual-language text descriptions.
Data Collection and Curation Methodology
Challenges with iNaturalist Open Data.
Curation of BioTrove.
Data Filtering and Preprocessing.
Models and Benchmarks
BioTrove-Train
New Benchmarks
BioTrove-CLIP: New vision-language foundation models for biodiversity
Experimental Results
Concluding Discussion
Appendix
...and 10 more sections

Figures (5)

Figure 1: Top Seven Phyla in the BioTrove Dataset. This figure displays the seven most frequently occurring phyla within BioTrove, which is curated to include data exclusively from the three primary kingdoms: Animalia, Plantae, and Fungi. For each phylum, the five most common species are shown, including their scientific names, common names, and the number of images per species. The phyla are ordered by species diversity, with the most diverse phylum on the right and the least diverse on the left.
Figure 2: Distribution of the BioTrove dataset. (a) Size of the top seven Phyla in the BioTrove dataset. (b) Species counts for the top seven Phyla. (c) The 40 highest occurring species in entire BioTrove dataset.
Figure 3: Treemap diagram of the BioTrove dataset, starting from Kingdom. The nested boxes represent phyla, (taxonomic) classes, orders, and families. Box size represents the relative number of samples.
Figure 4: (a) Example images from BioTrove-Unseen. (b) BioTrove-Life-Stages with 20 class labels: four life stages (egg, larva, pupa, and adult) for five distinct insect species.
Figure 5: BioTrove-Train Dataset Analysis: a) Consistent category distribution across BioTrove-Train and BioTrove-116M datasets. b) Species exhibit a long-tailed distribution. c) Impact of local vs. semi-global shuffling on species representation within training minibatches.

BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

TL;DR

Abstract

BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

Authors

TL;DR

Abstract

Table of Contents

Figures (5)