MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method

Ahsan Baidar Bakht; Mohamad Alansari; Muhayy Ud Din; Muzammal Naseer; Sajid Javed; Irfan Hussain; Jiri Matas; Arif Mahmood

MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method

Ahsan Baidar Bakht, Mohamad Alansari, Muhayy Ud Din, Muzammal Naseer, Sajid Javed, Irfan Hussain, Jiri Matas, Arif Mahmood

TL;DR

MUOT_3M and MUTrack are proposed, a SAM-based multimodal to unimodal tracker featuring visual geometric alignment, vision language fusion, and four level knowledge distillation that transfers multimodal knowledge into a unimodal student model that establishes a new foundation for scalable, multimodally trained yet practically deployable underwater tracking.

Abstract

Underwater Object Tracking (UOT) is crucial for efficient marine robotics, large scale ecological monitoring, and ocean exploration; however, progress has been hindered by the scarcity of large, multimodal, and diverse datasets. Existing benchmarks remain small and RGB only, limiting robustness under severe color distortion, turbidity, and low visibility conditions. We introduce MUOT_3M, the first pseudo multimodal UOT benchmark comprising 3 million frames from 3,030 videos (27.8h) annotated with 32 tracking attributes, 677 fine grained classes, and synchronized RGB, estimated enhanced RGB, estimated depth, and language modalities validated by a marine biologist. Building upon MUOT_3M, we propose MUTrack, a SAM-based multimodal to unimodal tracker featuring visual geometric alignment, vision language fusion, and four level knowledge distillation that transfers multimodal knowledge into a unimodal student model. Extensive evaluations across five UOT benchmarks demonstrate that MUTrack achieves up to 8.40% higher AUC and 7.80% higher precision than the strongest SOTA baselines while running at 24 FPS. MUOT_3M and MUTrack establish a new foundation for scalable, multimodally trained yet practically deployable underwater tracking.

MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method

TL;DR

Abstract

Paper Structure (22 sections, 3 equations, 17 figures, 7 tables)

This paper contains 22 sections, 3 equations, 17 figures, 7 tables.

Introduction
Related Work
Proposed MUOT-3M Dataset
Proposed MUTrack
Experiments
MUOT-3M Dataset Construction
Dataset Collection
YouTube Marine Videos
BiliBili Underwater Videos
Pexels and PixaBay Videos
Netflix and National Geographic Videos
Social Media Platform Videos
Dataset Curation
MUOT-3M Diversity
Dataset Total Cost
...and 7 more sections

Figures (17)

Figure 1: MUOT-3M dataset sample images. The language annotations are validated by an expert marine biologist.
Figure 2: MUOT-3M dataset diversity in terms of 16 Phylum categories, 124 families, and 677 fine-grained classes. 16 Phylum categories with corresponding representative families are shown. The distribution and labels of all classes are validated by the expert marine biologist. Non-marine species categories in MUOT-3M, i.e., human-related (diver, scuba) and non-biological (robot, ROVs), are not shown.
Figure 3: Performance degradation of SOTA trackers on WebUOT-1M zhang2024webuot and MUOT-3M.
Figure 4: MUOT-3M is much larger than existing UOT datasets.
Figure 5: MUTrack: Schematic of the proposed multimodal SAM-based tracking pipeline. Step 1 shows the pre-training process of visual-geometric and visual-textual alignments. Step 2 shows the proposed multimodal teacher tracker pre-trained on visual, geometric, and language cues, while Step 3 shows the proposed unimodal student tracker distilling knowledge from the multimodal teacher tracker.
...and 12 more figures

MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method

TL;DR

Abstract

MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method

Authors

TL;DR

Abstract

Table of Contents

Figures (17)