Table of Contents
Fetching ...

RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs

Logan Lawrence, Mustafa Chasmai, Rangel Daroya, Wuao Liu, Seoyun Jeong, Aaron Sun, Max Hamilton, Fabien Delattre, Oindrila Saha, Subhransu Maji, Grant Van Horn

Abstract

Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (e.g. vocalization), or obscured due to occlusion, camera angle, or low resolution. Yet today's multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rationale: "requires vocalization," "low quality image," or "view obstructed". For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models (less than 13% accuracy for MLLMs including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.

RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs

Abstract

Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (e.g. vocalization), or obscured due to occlusion, camera angle, or low resolution. Yet today's multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rationale: "requires vocalization," "low quality image," or "view obstructed". For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models (less than 13% accuracy for MLLMs including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.

Paper Structure

This paper contains 36 sections, 25 figures, 5 tables.

Figures (25)

  • Figure 1: Preview of RealBirdID. In contrast to previous species identification datasets, in each of the genus associated with RealBirdID there is a corresponding set of unanswerable (UA) examples. The summary metric proposed gauges both (1) the ability for the classifier to disambiguate between confusing classes and (2) abstain from predicting on unanswerable examples. Incorrect abstention reasoning is shown in red whereas correct reasoning is shown in green .
  • Figure 2: Peek into the dataset. Few examples for the answerable and unanswerable images in RealBirdID. Unanswerable examples are grouped by possible unanswerable reasons. Images from "Sound needed" may be harder to abstain than "Quality" for example, as more detailed knowledge about the particular genus would be needed to understand these reasons.
  • Figure 3: Distribution of images across the answerable (A) and unanswerable (UA) subsets in RealBirdID. The genera within the unanswerable (UA) subset exhibit a highly imbalanced, long-tailed distribution. For example, 85% of UA images originate from just 61 genera that each contain more than five unanswerable samples. For details on the distribution of the 3,442 species, see \ref{['appendix:sec:remaining_figures']}.
  • Figure 4: Multiple choice question (MCQ) formatting is a problem for encoder models. A straightforward implementation of an abstention class with CLIP models is to simply expose the genus level as a text prompt and treat its prediction as abstention. However, this approach greatly underperforms a modification from a previous hierarchical method, TreeGT deng2012hedging. "HM" refers to the harmonic mean of the "Answerable" and "Unanswerable" accuracies. Bird photographed by Tom Murray on iNaturalist (https://www.inaturalist.org/photos/155708903).
  • Figure 5: Visualization of sweeping parameters for classification and abstention metrics on popular CLIP-based models and MLLMs. To summarize how deep in the hierarchical classifiers can go while staying accurate we use Information Gain vs. Accuracy (a). Each classifier admits a tradeoff curve when predicting species (b) and genus (d). To measure separation between the unanswerable and answerable set, we measure the AUC of the model entropy (c).
  • ...and 20 more figures