Table of Contents
Fetching ...

VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

M. Maruf, Arka Daw, Kazi Sajeed Mehrab, Harish Babu Manogaran, Abhilash Neog, Medha Sawhney, Mridul Khurana, James P. Balhoff, Yasin Bakis, Bahadir Altintas, Matthew J. Thompson, Elizabeth G. Campolongo, Josef C. Uyeda, Hilmar Lapp, Henry L. Bart, Paula M. Mabee, Yu Su, Wei-Lun Chao, Charles Stewart, Tanya Berger-Wolf, Wasila Dahdul, Anuj Karpatne

TL;DR

The paper introduces VLM4Bio, a domain-specific benchmark to evaluate zero-shot capabilities of pretrained vision-language models on organismal biology, using 469K QA pairs across 30K images from fishes, birds, and butterflies over five tasks. It benchmarks 12 SOTA VLMs, investigates prompting and reasoning-hallucination through dedicated tests, and compares open-ended versus multiple-choice formats. Key contributions include a large, multi-task, domain-focused dataset with curated trait matrices and ground-truth annotations, plus comprehensive analyses of prompting strategies and the emergence of reasoning in large models (e.g., GPT-4V/4o) and the benefits of biologically fine-tuned baselines like BioCLIP. The findings reveal substantial gaps in zero-shot trait discovery for domain biology, guiding future work toward fine-tuning, retrieval-augmented approaches, and knowledge-infused prompting to better support biodiversity science.

Abstract

Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs). We ask if pre-trained VLMs can aid scientists in answering a range of biologically relevant questions without any additional fine-tuning. In this paper, we evaluate the effectiveness of 12 state-of-the-art (SOTA) VLMs in the field of organismal biology using a novel dataset, VLM4Bio, consisting of 469K question-answer pairs involving 30K images from three groups of organisms: fishes, birds, and butterflies, covering five biologically relevant tasks. We also explore the effects of applying prompting techniques and tests for reasoning hallucination on the performance of VLMs, shedding new light on the capabilities of current SOTA VLMs in answering biologically relevant questions using images. The code and datasets for running all the analyses reported in this paper can be found at https://github.com/sammarfy/VLM4Bio.

VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

TL;DR

The paper introduces VLM4Bio, a domain-specific benchmark to evaluate zero-shot capabilities of pretrained vision-language models on organismal biology, using 469K QA pairs across 30K images from fishes, birds, and butterflies over five tasks. It benchmarks 12 SOTA VLMs, investigates prompting and reasoning-hallucination through dedicated tests, and compares open-ended versus multiple-choice formats. Key contributions include a large, multi-task, domain-focused dataset with curated trait matrices and ground-truth annotations, plus comprehensive analyses of prompting strategies and the emergence of reasoning in large models (e.g., GPT-4V/4o) and the benefits of biologically fine-tuned baselines like BioCLIP. The findings reveal substantial gaps in zero-shot trait discovery for domain biology, guiding future work toward fine-tuning, retrieval-augmented approaches, and knowledge-infused prompting to better support biodiversity science.

Abstract

Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs). We ask if pre-trained VLMs can aid scientists in answering a range of biologically relevant questions without any additional fine-tuning. In this paper, we evaluate the effectiveness of 12 state-of-the-art (SOTA) VLMs in the field of organismal biology using a novel dataset, VLM4Bio, consisting of 469K question-answer pairs involving 30K images from three groups of organisms: fishes, birds, and butterflies, covering five biologically relevant tasks. We also explore the effects of applying prompting techniques and tests for reasoning hallucination on the performance of VLMs, shedding new light on the capabilities of current SOTA VLMs in answering biologically relevant questions using images. The code and datasets for running all the analyses reported in this paper can be found at https://github.com/sammarfy/VLM4Bio.
Paper Structure (32 sections, 37 figures, 7 tables)

This paper contains 32 sections, 37 figures, 7 tables.

Figures (37)

  • Figure 1: Overview of our goals and contributions. We analyze the capabilities of 12 state-of-the-art (SOTA) vision-language models (VLMs) in answering scientific questions using images from three groups of organisms: fishes, birds, and butterflies, over five groups of biologically relevant tasks. We also explore the effectiveness of these models for reasoning using various prompting techniques and tests for reasoning hallucination.
  • Figure 2: Illustrative examples of VLM4Bio tasks with different question-types.
  • Figure 3: Examples of correct and incorrect predictions of GPT-4V for trait identification, trait grounding, and trait-referring tasks related to the "eye". For visualization assistance, a red-colored bounding box is added around the "eye" in the image.
  • Figure 4: t-SNE plots to illustrate the effectiveness of random sampling with the majority species in the Fish-10K dataset. Randomly sampled images are shown as blue dots, while the remaining data points are represented by red dots. Subcaptions display the scientific names of the corresponding species. To generate the vector representation of the images, we leverage a VGG19 pretrained on the ImageNet dataset.
  • Figure 5: Dataset Distribution of Fish-$10K$, Bird-$10K$, and Butterfly-$10K$.
  • ...and 32 more figures