FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology
Faizan Farooq Khan, Yousef Radwan, Eslam Abdelrahman, Abdulwahab Felemban, Aymen Mir, Nico K. Michiels, Andrew J. Temple, Michael L. Berumen, Mohamed Elhoseiny
TL;DR
This work shows that state-of-the-art multimodal large language models struggle with fine-grained marine species recognition, revealing gaps in both domain knowledge and visual discrimination. It introduces FishNet++, a large-scale multimodal benchmark with extensive textual descriptions, key-point annotations, and bounding boxes to diagnose and improve marine-domain understanding. Through targeted diagnostics, the paper disentangles errors due to domain knowledge, perception, and reasoning, and demonstrates that finetuning on FishNet++ (especially with explainable supervision) substantially boosts performance. The benchmark and findings provide a path toward domain-specific, language-informed models that can support marine biodiversity monitoring and conservation efforts.
Abstract
Multimodal large language models (MLLMs) have demonstrated impressive cross-domain capabilities, yet their proficiency in specialized scientific fields like marine biology remains underexplored. In this work, we systematically evaluate state-of-the-art MLLMs and reveal significant limitations in their ability to perform fine-grained recognition of fish species, with the best open-source models achieving less than 10\% accuracy. This task is critical for monitoring marine ecosystems under anthropogenic pressure. To address this gap and investigate whether these failures stem from a lack of domain knowledge, we introduce FishNet++, a large-scale, multimodal benchmark. FishNet++ significantly extends existing resources with 35,133 textual descriptions for multimodal learning, 706,426 key-point annotations for morphological studies, and 119,399 bounding boxes for detection. By providing this comprehensive suite of annotations, our work facilitates the development and evaluation of specialized vision-language models capable of advancing aquatic science.
