Table of Contents
Fetching ...

FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology

Faizan Farooq Khan, Yousef Radwan, Eslam Abdelrahman, Abdulwahab Felemban, Aymen Mir, Nico K. Michiels, Andrew J. Temple, Michael L. Berumen, Mohamed Elhoseiny

TL;DR

This work shows that state-of-the-art multimodal large language models struggle with fine-grained marine species recognition, revealing gaps in both domain knowledge and visual discrimination. It introduces FishNet++, a large-scale multimodal benchmark with extensive textual descriptions, key-point annotations, and bounding boxes to diagnose and improve marine-domain understanding. Through targeted diagnostics, the paper disentangles errors due to domain knowledge, perception, and reasoning, and demonstrates that finetuning on FishNet++ (especially with explainable supervision) substantially boosts performance. The benchmark and findings provide a path toward domain-specific, language-informed models that can support marine biodiversity monitoring and conservation efforts.

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive cross-domain capabilities, yet their proficiency in specialized scientific fields like marine biology remains underexplored. In this work, we systematically evaluate state-of-the-art MLLMs and reveal significant limitations in their ability to perform fine-grained recognition of fish species, with the best open-source models achieving less than 10\% accuracy. This task is critical for monitoring marine ecosystems under anthropogenic pressure. To address this gap and investigate whether these failures stem from a lack of domain knowledge, we introduce FishNet++, a large-scale, multimodal benchmark. FishNet++ significantly extends existing resources with 35,133 textual descriptions for multimodal learning, 706,426 key-point annotations for morphological studies, and 119,399 bounding boxes for detection. By providing this comprehensive suite of annotations, our work facilitates the development and evaluation of specialized vision-language models capable of advancing aquatic science.

FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology

TL;DR

This work shows that state-of-the-art multimodal large language models struggle with fine-grained marine species recognition, revealing gaps in both domain knowledge and visual discrimination. It introduces FishNet++, a large-scale multimodal benchmark with extensive textual descriptions, key-point annotations, and bounding boxes to diagnose and improve marine-domain understanding. Through targeted diagnostics, the paper disentangles errors due to domain knowledge, perception, and reasoning, and demonstrates that finetuning on FishNet++ (especially with explainable supervision) substantially boosts performance. The benchmark and findings provide a path toward domain-specific, language-informed models that can support marine biodiversity monitoring and conservation efforts.

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive cross-domain capabilities, yet their proficiency in specialized scientific fields like marine biology remains underexplored. In this work, we systematically evaluate state-of-the-art MLLMs and reveal significant limitations in their ability to perform fine-grained recognition of fish species, with the best open-source models achieving less than 10\% accuracy. This task is critical for monitoring marine ecosystems under anthropogenic pressure. To address this gap and investigate whether these failures stem from a lack of domain knowledge, we introduce FishNet++, a large-scale, multimodal benchmark. FishNet++ significantly extends existing resources with 35,133 textual descriptions for multimodal learning, 706,426 key-point annotations for morphological studies, and 119,399 bounding boxes for detection. By providing this comprehensive suite of annotations, our work facilitates the development and evaluation of specialized vision-language models capable of advancing aquatic science.

Paper Structure

This paper contains 26 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Examples of species description summarized by GPT-4o gpt4o using information scraped from credible sources as described in \ref{['subsec:description']}.
  • Figure 2: Example images from FishNet++ showcasing part-level annotations. Each keypoint is color-coded by semantic part: eye (orange), fins (blue), mouth (magenta), body center (yellow), tail start (green), and tail apex (red). The number and placement of fins vary across species, and some species exhibit a forked tail apex. For each image, we also display the annotated bounding box.
  • Figure 3: Qualitative examples of species identification and reasoning generated by our finetuned Qwen-VL model when trained for explainability.
  • Figure 4: We show the same images from \ref{['fig:data']} with segmentation masks obtained from our automated pipeline using key-points as supervision.
  • Figure 5: We show the samples with erroneous segmentation masks obtained from our automated pipeline using key-points as supervision.
  • ...and 1 more figures