Surely Large Multimodal Models (Don't) Excel in Visual Species Recognition?

Tian Liu; Anwesha Basu; James Caverlee; Shu Kong

Surely Large Multimodal Models (Don't) Excel in Visual Species Recognition?

Tian Liu, Anwesha Basu, James Caverlee, Shu Kong

TL;DR

This study evaluates Large Multimodal Models (LMMs) for Visual Species Recognition (VSR) and finds they generally underperform compared with well-tuned few-shot expert models. The authors uncover that LMMs can, however, post-hoc correct predictions from FSL experts when prompted with top-k candidate species, their confidences, and few-shot visual examples. They propose Post-hoc Correction (POC), a training-free, plug-in prompting framework that re-ranks the expert’s top-k predictions, achieving about +6.4% average accuracy across five challenging VSR benchmarks without extra data or validation. POC generalizes across backbones and LMMs, offering a practical, model-agnostic boost to existing FSL methods for domain-specific VSR tasks. This work highlights a scalable pathway to leverage LMMs for specialized biodiversity recognition without costly retraining.

Abstract

Visual Species Recognition (VSR) is pivotal to biodiversity assessment and conservation, evolution research, and ecology and ecosystem management. Training a machine-learned model for VSR typically requires vast amounts of annotated images. Yet, species-level annotation demands domain expertise, making it realistic for domain experts to annotate only a few examples. These limited labeled data motivate training an ''expert'' model via few-shot learning (FSL). Meanwhile, advanced Large Multimodal Models (LMMs) have demonstrated prominent performance on general recognition tasks. It is straightforward to ask whether LMMs excel in the highly specialized VSR task and whether they outshine FSL expert models. Somewhat surprisingly, we find that LMMs struggle in this task, despite using various established prompting techniques. LMMs even significantly underperform FSL expert models, which are as simple as finetuning a pretrained visual encoder on the few-shot images. However, our in-depth analysis reveals that LMMs can effectively post-hoc correct the expert models' incorrect predictions. Briefly, given a test image, when prompted with the top predictions from an FSL expert model, LMMs can recover the ground-truth label. Building on this insight, we derive a simple method called Post-hoc Correction (POC), which prompts an LMM to re-rank the expert model's top predictions using enriched prompts that include softmax confidence scores and few-shot visual examples. Across five challenging VSR benchmarks, POC outperforms prior art of FSL by +6.4% in accuracy without extra training, validation, or manual intervention. Importantly, POC generalizes to different pretrained backbones and LMMs, serving as a plug-and-play module to significantly enhance existing FSL methods.

Surely Large Multimodal Models (Don't) Excel in Visual Species Recognition?

TL;DR

Abstract

Surely Large Multimodal Models (Don't) Excel in Visual Species Recognition?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (20)