Table of Contents
Fetching ...

FishDetector-R1: Unified MLLM-Based Framework with Reinforcement Fine-Tuning for Weakly Supervised Fish Detection, Segmentation, and Counting

Yi Liu, Jingyu Song, Vedanth Kallakuri, Katherine A. Skinner

TL;DR

<3-5 sentence high-level summary> FishDetector-R1 presents a unified, weakly supervised framework for underwater fish detection, segmentation, and counting by uniting an MLLM with a segmentation foundation model through a detect-to-count prompt and reinforcement fine-tuning (RLVR) via GRPO. The approach enforces spatial and numerical consistency between localization and counting, providing strong pixel-wise segmentation with sparse annotations and demonstrating robust cross-domain generalization to SUIM. Key contributions include the novel detect-to-count prompting, the RLVR objective, and extensive ablations showing complementary reward signals. Empirical results on DeepFish show competitive or superior performance to fully supervised baselines in some settings, with zero-shot transfer validated on SUIM, indicating practical impact for scalable ecological monitoring and marine habitat assessment.

Abstract

Analyzing underwater fish imagery is critical for ecological monitoring but remains difficult due to visual degradation and costly annotations. We introduce FishDetector-R1, a unified MLLM-based framework for fish detection, segmentation, and counting under weak supervision. On the DeepFish dataset, our framework achieves substantial gains over baselines, improving AP by 20% and mIoU by 10%, while reducing MAE by 30% and GAME by 35%. These improvements stem from two key components: a novel detect-to-count prompt that enforces spatially consistent detections and counts, and Reinforcement Learning from Verifiable Reward (RLVR) with a complementary scalable paradigm leveraging sparse point labels. Ablation studies further validate the effectiveness of this reward design. Moreover, the improvement generalizes well to other underwater datasets, confirming strong cross-domain robustness. Overall, FishDetector-R1 provides a reliable and scalable solution for accurate marine visual understanding via weak supervision. The project page for FishDetector-R1 is https://umfieldrobotics.github.io/FishDetector-R1.

FishDetector-R1: Unified MLLM-Based Framework with Reinforcement Fine-Tuning for Weakly Supervised Fish Detection, Segmentation, and Counting

TL;DR

<3-5 sentence high-level summary> FishDetector-R1 presents a unified, weakly supervised framework for underwater fish detection, segmentation, and counting by uniting an MLLM with a segmentation foundation model through a detect-to-count prompt and reinforcement fine-tuning (RLVR) via GRPO. The approach enforces spatial and numerical consistency between localization and counting, providing strong pixel-wise segmentation with sparse annotations and demonstrating robust cross-domain generalization to SUIM. Key contributions include the novel detect-to-count prompting, the RLVR objective, and extensive ablations showing complementary reward signals. Empirical results on DeepFish show competitive or superior performance to fully supervised baselines in some settings, with zero-shot transfer validated on SUIM, indicating practical impact for scalable ecological monitoring and marine habitat assessment.

Abstract

Analyzing underwater fish imagery is critical for ecological monitoring but remains difficult due to visual degradation and costly annotations. We introduce FishDetector-R1, a unified MLLM-based framework for fish detection, segmentation, and counting under weak supervision. On the DeepFish dataset, our framework achieves substantial gains over baselines, improving AP by 20% and mIoU by 10%, while reducing MAE by 30% and GAME by 35%. These improvements stem from two key components: a novel detect-to-count prompt that enforces spatially consistent detections and counts, and Reinforcement Learning from Verifiable Reward (RLVR) with a complementary scalable paradigm leveraging sparse point labels. Ablation studies further validate the effectiveness of this reward design. Moreover, the improvement generalizes well to other underwater datasets, confirming strong cross-domain robustness. Overall, FishDetector-R1 provides a reliable and scalable solution for accurate marine visual understanding via weak supervision. The project page for FishDetector-R1 is https://umfieldrobotics.github.io/FishDetector-R1.

Paper Structure

This paper contains 31 sections, 12 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Our proposed FishDetector-R1 aims to achieve AI-enabled fish image analysis with the guidance of sparse point labels and text prompts.
  • Figure 2: Overview of the proposed FishDetector-R1 framework. A two-stage detect-to-count pipeline integrates an MLLM with SAM 2 to jointly perform detection, segmentation, and counting. Reinforcement fine-tuning with GRPO and weak point-level supervision adapts the MLLM, ensuring consistency between detection and counting while enabling pixel-wise segmentation with only sparse labels.
  • Figure 3: Example Q&A pairs from FishDetector-R1 using our designed detect-to-count prompt.
  • Figure 4: Qualitative Comparison between Qwen2.5-VL and FishDetector-R1. On a challenging scene from DeepFish FishLoc, our detect-to-count strategy enables more accurate localization and structured outputs.
  • Figure 5: Detection and segmentation results of FishDetector-R1 across diverse underwater habitats. The model demonstrates robustness to variations in background complexity, lighting conditions, and water color, highlighting its applicability to real-world marine environments.
  • ...and 3 more figures