Table of Contents
Fetching ...

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Jiawei Wang, Ming Lei, Yaning Yang, Xinyan Lin, Yuquan Le, Qiwei Ma, Zhiwei Xu, Zheqi Lv, Yuchen Ang, Zhe Quan, Tat-Seng Chua

Abstract

Identifying species in biology among tens of thousands of visually similar taxa while discovering unknown species in open-world environments remains a fundamental challenge in biodiversity research. Current methods treat identification and discovery as separate problems, with classification models assuming closed sets and discovery relying on threshold-based rejection. Here we present DeepTaxon, a retrieval-augmented multimodal framework that unifies species identification and discovery through interpretable reasoning over retrieved visual evidence. Given a query image, DeepTaxon retrieves the top-$k$ candidate species with $n$ exemplar images each from a retrieval index and performs chain-of-thought comparative reasoning. Critically, we redefine discovery as an explicit, retrieval-based decision problem rather than an implicit parametric memory problem. A sample is novel if and only if the retrieval index lacks sufficient evidence for identification, so each retrieval naturally yields a classification or discovery label without manual annotation, thereby providing automatic supervision for both tasks. We train the framework via supervised fine-tuning on synthetic retrieval-augmented data, followed by reinforcement learning on hard samples, converting high-recall retrieval into high-precision decisions that scale to massive taxonomic vocabularies. Extensive experiments on a large-scale in-distribution benchmark and six out-of-distribution datasets demonstrate consistent improvements in both identification and discovery. Ablation studies further reveal effective test-time scaling with candidate count $k$ and exemplar count $n$, strong zero-shot transfer to unseen domains, and consistent performance across retrieval encoders, establishing an interpretable solution for biodiversity research.

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Abstract

Identifying species in biology among tens of thousands of visually similar taxa while discovering unknown species in open-world environments remains a fundamental challenge in biodiversity research. Current methods treat identification and discovery as separate problems, with classification models assuming closed sets and discovery relying on threshold-based rejection. Here we present DeepTaxon, a retrieval-augmented multimodal framework that unifies species identification and discovery through interpretable reasoning over retrieved visual evidence. Given a query image, DeepTaxon retrieves the top- candidate species with exemplar images each from a retrieval index and performs chain-of-thought comparative reasoning. Critically, we redefine discovery as an explicit, retrieval-based decision problem rather than an implicit parametric memory problem. A sample is novel if and only if the retrieval index lacks sufficient evidence for identification, so each retrieval naturally yields a classification or discovery label without manual annotation, thereby providing automatic supervision for both tasks. We train the framework via supervised fine-tuning on synthetic retrieval-augmented data, followed by reinforcement learning on hard samples, converting high-recall retrieval into high-precision decisions that scale to massive taxonomic vocabularies. Extensive experiments on a large-scale in-distribution benchmark and six out-of-distribution datasets demonstrate consistent improvements in both identification and discovery. Ablation studies further reveal effective test-time scaling with candidate count and exemplar count , strong zero-shot transfer to unseen domains, and consistent performance across retrieval encoders, establishing an interpretable solution for biodiversity research.
Paper Structure (30 sections, 4 equations, 6 figures, 9 tables)

This paper contains 30 sections, 4 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Pass@$k$ curves as a function of retrieved species count $k$ on iNaturalist-10K. Pass@$k$ measures whether the ground-truth species appears among the top-$k$ distinct species retrieved. The star markers denote the best classification accuracy of conventional top-1 retrieval and DeepTaxon, respectively, revealing a significant retrieval-decision gap that DeepTaxon substantially narrows. Detailed numerical values are provided in Appendix \ref{['app:passk_data']}.
  • Figure 2: Comparison of five paradigms for species identification and discovery, evaluated across five capabilities: (1) classification, (2) discovery, (3) open-set generalization, (4) interpretability, and (5) test-time scaling. Existing approaches achieve only a subset of these capabilities. DeepTaxon uniquely satisfies all five requirements through retrieval-augmented multimodal reasoning.
  • Figure 3: Overview of DeepTaxon. Given a query image, the retrieval module retrieves $k$ candidate species with $n$ exemplars each from the retrieval index. The reasoning module performs comparative analysis and outputs either a taxonomic classification or a discovery signal. Two-stage training with supervised fine-tuning and GRPO optimizes decision quality.
  • Figure 4: Cross-domain evaluation matrices (RQ5). Rows represent retrieval index domains, columns represent query domains. Diagonal (boxed) = classification accuracy, off-diagonal = discovery rate.
  • Figure 5: Bird case study: Qwen2.5-VL hallucinates habitat-based reasoning ("found in trees rather than on wires") and outputs Discovery. DeepTaxon grounds reasoning in visual features (yellow head, gray body, white collar) and correctly classifies as Ptilotula penicillata.
  • ...and 1 more figures