Table of Contents
Fetching ...

Surely Large Multimodal Models (Don't) Excel in Visual Species Recognition?

Tian Liu, Anwesha Basu, James Caverlee, Shu Kong

TL;DR

This study evaluates Large Multimodal Models (LMMs) for Visual Species Recognition (VSR) and finds they generally underperform compared with well-tuned few-shot expert models. The authors uncover that LMMs can, however, post-hoc correct predictions from FSL experts when prompted with top-k candidate species, their confidences, and few-shot visual examples. They propose Post-hoc Correction (POC), a training-free, plug-in prompting framework that re-ranks the expert’s top-k predictions, achieving about +6.4% average accuracy across five challenging VSR benchmarks without extra data or validation. POC generalizes across backbones and LMMs, offering a practical, model-agnostic boost to existing FSL methods for domain-specific VSR tasks. This work highlights a scalable pathway to leverage LMMs for specialized biodiversity recognition without costly retraining.

Abstract

Visual Species Recognition (VSR) is pivotal to biodiversity assessment and conservation, evolution research, and ecology and ecosystem management. Training a machine-learned model for VSR typically requires vast amounts of annotated images. Yet, species-level annotation demands domain expertise, making it realistic for domain experts to annotate only a few examples. These limited labeled data motivate training an ''expert'' model via few-shot learning (FSL). Meanwhile, advanced Large Multimodal Models (LMMs) have demonstrated prominent performance on general recognition tasks. It is straightforward to ask whether LMMs excel in the highly specialized VSR task and whether they outshine FSL expert models. Somewhat surprisingly, we find that LMMs struggle in this task, despite using various established prompting techniques. LMMs even significantly underperform FSL expert models, which are as simple as finetuning a pretrained visual encoder on the few-shot images. However, our in-depth analysis reveals that LMMs can effectively post-hoc correct the expert models' incorrect predictions. Briefly, given a test image, when prompted with the top predictions from an FSL expert model, LMMs can recover the ground-truth label. Building on this insight, we derive a simple method called Post-hoc Correction (POC), which prompts an LMM to re-rank the expert model's top predictions using enriched prompts that include softmax confidence scores and few-shot visual examples. Across five challenging VSR benchmarks, POC outperforms prior art of FSL by +6.4% in accuracy without extra training, validation, or manual intervention. Importantly, POC generalizes to different pretrained backbones and LMMs, serving as a plug-and-play module to significantly enhance existing FSL methods.

Surely Large Multimodal Models (Don't) Excel in Visual Species Recognition?

TL;DR

This study evaluates Large Multimodal Models (LMMs) for Visual Species Recognition (VSR) and finds they generally underperform compared with well-tuned few-shot expert models. The authors uncover that LMMs can, however, post-hoc correct predictions from FSL experts when prompted with top-k candidate species, their confidences, and few-shot visual examples. They propose Post-hoc Correction (POC), a training-free, plug-in prompting framework that re-ranks the expert’s top-k predictions, achieving about +6.4% average accuracy across five challenging VSR benchmarks without extra data or validation. POC generalizes across backbones and LMMs, offering a practical, model-agnostic boost to existing FSL methods for domain-specific VSR tasks. This work highlights a scalable pathway to leverage LMMs for specialized biodiversity recognition without costly retraining.

Abstract

Visual Species Recognition (VSR) is pivotal to biodiversity assessment and conservation, evolution research, and ecology and ecosystem management. Training a machine-learned model for VSR typically requires vast amounts of annotated images. Yet, species-level annotation demands domain expertise, making it realistic for domain experts to annotate only a few examples. These limited labeled data motivate training an ''expert'' model via few-shot learning (FSL). Meanwhile, advanced Large Multimodal Models (LMMs) have demonstrated prominent performance on general recognition tasks. It is straightforward to ask whether LMMs excel in the highly specialized VSR task and whether they outshine FSL expert models. Somewhat surprisingly, we find that LMMs struggle in this task, despite using various established prompting techniques. LMMs even significantly underperform FSL expert models, which are as simple as finetuning a pretrained visual encoder on the few-shot images. However, our in-depth analysis reveals that LMMs can effectively post-hoc correct the expert models' incorrect predictions. Briefly, given a test image, when prompted with the top predictions from an FSL expert model, LMMs can recover the ground-truth label. Building on this insight, we derive a simple method called Post-hoc Correction (POC), which prompts an LMM to re-rank the expert model's top predictions using enriched prompts that include softmax confidence scores and few-shot visual examples. Across five challenging VSR benchmarks, POC outperforms prior art of FSL by +6.4% in accuracy without extra training, validation, or manual intervention. Importantly, POC generalizes to different pretrained backbones and LMMs, serving as a plug-and-play module to significantly enhance existing FSL methods.

Paper Structure

This paper contains 18 sections, 20 figures, 10 tables.

Figures (20)

  • Figure 1: Overview of methods on Visual Species Recognition (VSR). We compare the results of (a) Large Multimodal Models (LMMs; e.g., Qwen-2.5-VL-7B-Instruct qwen2.5-vl) under various prompting strategies kojima2022largeweng2023large, and (b) a few-shot learned (FSL) "expert" model obtained by finetuning a Vision-Language Model's (VLM; e.g., CLIP radford2021learning) visual encoder on few-shot data liu2025few. Despite being pretrained on web-scale data, LMMs struggle in VSR and significantly underperform the FSL expert model. (c) However, we find that the correct label is often in the top-$k$ predictions of the expert model, and when prompted properly, the LMM can identify the correct one. Motivated by this, we propose Post-hoc Correction (POC), a simple plug-and-play method that harnesses LMMs to post-process expert models' predictions. Across five benchmarks, POC significantly improves existing FSL methods without extra training, validation, or manual intervention.
  • Figure 2: Examples of test images from five VSR benchmarks, along with an expert model's top-3 predicted species and softmax confidence scores. A reference image is provided for each predicted species. We train an expert model by finetuning the visual encoder of OpenCLIP ViT-B/32 cherti2023reproducible on 16-shot data following liu2025few. The prevalence of visually similar species among top-3 predictions underscores the challenges of VSR. Notably, even when top-1 predictions are incorrect (marked by red boxes), the top-3 often contain correct species (marked by green boxes). Importantly, LMM can identify the correct ones through a post-hoc process!
  • Figure 3: Top-$k$ accuracies of the FSL expert model. As expected, the top-5 accuracy is substantially higher than the top-1 accuracy, since larger $k$ values naturally yield higher scores. The large gap between top-1 and top-5 metrics indicates that even when the expert model's top-1 prediction is incorrect, the correct label often appears among the top-5 predictions (see visual examples in \ref{['fig:handpicked_examples']}). This observation partially motivates our post-hoc correction method, which aims to find the correct species from the top-$k$ predictions.
  • Figure 4: Post-hoc Correction (POC) workflow. POC combines a few-shot learned expert model (e.g., finetuning a VLM's visual encoder liu2025few) with an LMM for better VSR. Specifically, for a test image, the expert model predicts the top-$k$ species along with their corresponding softmax confidence scores. Then, POC constructs a few-shot in-context prompt jiang2405many by supplementing the test image with top-$k$ species names, confidences, and few-shot examples. Based on the given context, the LMM is instructed to re-rank the top-$k$ species. Finally, the top-ranked species from its output is returned as the final prediction. We use $k=5$ in our study and compare different $k$ values in \ref{['fig:ablate_topk']}.
  • Figure 5: Comparison of mean accuracy averaged across five benchmarks using various pretrained backbones. Following liu2025few, We train an expert model (termed "Few-shot FT") by finetuning different pretrained visual encoders on 16-shot labeled data sampled with three random seeds, and then run POC with the LMM Qwen-2.5-VL-7B-Instruct qwen2.5-vl. Results show that POC consistently improves the expert model liu2025few of different backbones, with small standard deviations. The accuracy gains are larger for less powerful backbones, such as the ImageNet-pretrained ResNet-50 model he2016deep. It is important to note a data leakage issue: the biological foundation model BioCLIP's stevens2024bioclip pretraining data contains iNaturalist inat2021, from which our benchmarking datasets Aves, Insecta, Weeds, and Mollusca are partially sourced. This helps explain the BioCLIP's strong performance and the diminishing gains of POC when using this backbone. Detailed performance of each backbone on each dataset is provided in Supplementary \ref{['sec:detailed_results']}.
  • ...and 15 more figures