Table of Contents
Fetching ...

BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks

Samuel Stevens

TL;DR

ImageNet-1K top-1 accuracy no longer reliably predicts performance on scientific imagery, motivating BioBench, a domain-grounded ecology benchmark with 9 tasks, 4 kingdoms, and 6 acquisition modalities (3.1M images). Using a minimal embedding API and linear probes, the study finds $R^2=0.34$ and $\rho=0.55$ overall between ImageNet and BioBench, with frontier models increasingly mis-ranked ($\approx30\%$) above $75\%$ ImageNet, underscoring the mismatch. Only a few generalist models (CLIP, SigLIP, SigLIP2) achieve new BioBench state-of-the-art scores, highlighting limited transfer from general benchmarks to ecological tasks. The work provides a practical, reproducible template for domain-specific benchmarks and demonstrates how application-driven evaluation can better guide AI progress in science.

Abstract

ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.

BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks

TL;DR

ImageNet-1K top-1 accuracy no longer reliably predicts performance on scientific imagery, motivating BioBench, a domain-grounded ecology benchmark with 9 tasks, 4 kingdoms, and 6 acquisition modalities (3.1M images). Using a minimal embedding API and linear probes, the study finds and overall between ImageNet and BioBench, with frontier models increasingly mis-ranked () above ImageNet, underscoring the mismatch. Only a few generalist models (CLIP, SigLIP, SigLIP2) achieve new BioBench state-of-the-art scores, highlighting limited transfer from general benchmarks to ecological tasks. The work provides a practical, reproducible template for domain-specific benchmarks and demonstrates how application-driven evaluation can better guide AI progress in science.

Abstract

ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.

Paper Structure

This paper contains 5 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Predictive validity of ImageNet-1K accuracy across (a) species classification of plants tan2019herbarium19, (b) species classification of animals in camera trap images beery2020iwildcamkoh2021wilds and (c) individual identification of beluga whales algasov2024belugavcermak2024wildlifedatasets measured with Spearman's rank correlation coefficient $\rho$ between ImageNet-1K and task rankings, computed across all checkpoints with ImageNet Top-1 accuracy $\geq T\%$ (x-axis). Shaded region shows 95.0% bootstrapped confidence intervals. ImageNet-1K fails to predict model rankings on specific tasks as models improve.
  • Figure 2: Left (a-c): Random example images from ImageNet-1K, MSCOCO and ADE20K, three popular general-domain vision benchmarks deng2009imagenet1klin2014mscocozhou2017ade20k. Right (d-l): Random example images from each of the nine tasks in BioBench. Tasks in BioBench have radically different image distributions compared to general-domain vision benchmarks.
  • Figure 3: BioBench scores over time. The majority of new models fail to improve on BioBench.