Table of Contents
Fetching ...

Open-Insect: Benchmarking Open-Set Recognition of Novel Species in Biodiversity Monitoring

Yuyan Chen, Nico Lang, B. Christian Schmidt, Aditya Jain, Yves Basset, Sara Beery, Maxim Larrivée, David Rolnick

TL;DR

Open-Insect targets open-set recognition for novel species in biodiversity monitoring by introducing a large, fine-grained insect dataset spanning three geographic regions and multiple open-set configurations. It benchmarks 38 OSR methods across three categories (post-hoc, training-time regularization, and auxiliary-data-based) and finds simple post-hoc baselines like MSP to be competitive, while auxiliary data and careful pretraining further boost performance. The study demonstrates that local open-set species are more challenging due to taxonomic similarity, and that realistic auxiliary data improves discovery potential, including better generalization to possible undescribed species in the wild. These insights advance practical OSR methods for biodiversity monitoring and provide a framework for evaluating species discovery pipelines with regionally relevant data, while also addressing explainability and ethical considerations.

Abstract

Global biodiversity is declining at an unprecedented rate, yet little information is known about most species and how their populations are changing. Indeed, some 90% of Earth's species are estimated to be completely unknown. Machine learning has recently emerged as a promising tool to facilitate long-term, large-scale biodiversity monitoring, including algorithms for fine-grained classification of species from images. However, such algorithms typically are not designed to detect examples from categories unseen during training -- the problem of open-set recognition (OSR) -- limiting their applicability for highly diverse, poorly studied taxa such as insects. To address this gap, we introduce Open-Insect, a large-scale, fine-grained dataset to evaluate unknown species detection across different geographic regions with varying difficulty. We benchmark 38 OSR algorithms across three categories: post-hoc, training-time regularization, and training with auxiliary data, finding that simple post-hoc approaches remain a strong baseline. We also demonstrate how to leverage auxiliary data to improve species discovery in regions with limited data. Our results provide insights to guide the development of computer vision methods for biodiversity monitoring and species discovery.

Open-Insect: Benchmarking Open-Set Recognition of Novel Species in Biodiversity Monitoring

TL;DR

Open-Insect targets open-set recognition for novel species in biodiversity monitoring by introducing a large, fine-grained insect dataset spanning three geographic regions and multiple open-set configurations. It benchmarks 38 OSR methods across three categories (post-hoc, training-time regularization, and auxiliary-data-based) and finds simple post-hoc baselines like MSP to be competitive, while auxiliary data and careful pretraining further boost performance. The study demonstrates that local open-set species are more challenging due to taxonomic similarity, and that realistic auxiliary data improves discovery potential, including better generalization to possible undescribed species in the wild. These insights advance practical OSR methods for biodiversity monitoring and provide a framework for evaluating species discovery pipelines with regionally relevant data, while also addressing explainability and ethical considerations.

Abstract

Global biodiversity is declining at an unprecedented rate, yet little information is known about most species and how their populations are changing. Indeed, some 90% of Earth's species are estimated to be completely unknown. Machine learning has recently emerged as a promising tool to facilitate long-term, large-scale biodiversity monitoring, including algorithms for fine-grained classification of species from images. However, such algorithms typically are not designed to detect examples from categories unseen during training -- the problem of open-set recognition (OSR) -- limiting their applicability for highly diverse, poorly studied taxa such as insects. To address this gap, we introduce Open-Insect, a large-scale, fine-grained dataset to evaluate unknown species detection across different geographic regions with varying difficulty. We benchmark 38 OSR algorithms across three categories: post-hoc, training-time regularization, and training with auxiliary data, finding that simple post-hoc approaches remain a strong baseline. We also demonstrate how to leverage auxiliary data to improve species discovery in regions with limited data. Our results provide insights to guide the development of computer vision methods for biodiversity monitoring and species discovery.

Paper Structure

This paper contains 33 sections, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Open-Insect benchmark results on three geographical regions with varying difficulty. The Open-Insect benchmark includes images of thousands of highly visually similar moth species, along with non-moth arthropods, divided by geographic region. Left: Results from 38 OSR methods on three open-set types i) Local moth, ii) Non-local moth, and iii) Non-moth (see Table \ref{['tab: main result']}). Right: Visual dissimilarity across taxonomic levels: 1-hop (same genus), 2-hop (different genus, same family), 3-hop (different family within Lepidoptera), and non-moths (different order, $\geq$4 hops).
  • Figure 2: Open-Insect dataset overview. Regions A, B, and C correspond to closed-sets and local open-sets, while region D corresponds to the non-local open-set. (Region A: NE-America; B: W-Europe; C: C-America; D: Australia.) The tree maps visualize the taxonomic distribution of moth families in regions C and D, where nested boxes denote genera and species, and box size indicates the relative number of images. The same family is colored consistently across the three treemaps. Local open-set species are more similar to the closed-set than non-local ones. Biome codes and tree maps for the other two regions are provided in the Appendix.
  • Figure 3: Examples of likely novel species from BCI. These species show greater than 7% DNA barcode divergence from their closest match in the BOLD database, well above the 1.5% species-level cutoff.
  • Figure 4: TPR@5 (BCI) vs. AUROC (C-America O-L post-hoc methods). Overall, models that perform well on C-America O-L also tend to achieve higher performance on the BCI data.
  • Figure 5: Visualization of the Open-Insect taxonomic distribution. Tree maps (a)–(f) show the taxonomic composition of moth families in Open-Insect across three regions. Each nested box represents a genus or species, and box size reflects the relative number of images. The same family is colored consistently across regions. Local open-set species display taxonomic distributions more similar to their corresponding closed-set species, indicating shared families and comparable visual traits. In contrast, the non-local open-set samples from Australia (g) exhibit markedly different taxonomic and color patterns, reflecting greater divergence from the training regions.
  • ...and 4 more figures