WebUOT-1M: Advancing Deep Underwater Object Tracking with A Million-Scale Benchmark
Chunhui Zhang, Li Liu, Guanjie Huang, Hao Wen, Xi Zhou, Yanfeng Wang
TL;DR
WebUOT-1M introduces the first million-scale underwater object tracking benchmark, addressing scale and diversity gaps in prior datasets by providing 1.1 million frames across 1,500 videos and 408 categories, with language prompts and 23 tracking attributes. To exploit cross-domain knowledge, the authors propose OKTrack, an omni-knowledge distillation framework that transfers open-air tracking knowledge from a multi-view teacher to a unimodal underwater student, augmented by a MATP module to mitigate drift. Evaluations on 30 trackers reveal Transformer-based and underwater-specific methods perform best, with substantial gains achieved through retraining and the proposed distillation approach, and highlight the dataset's potential to spur multimodal underwater tracking research. The work also demonstrates the utility of language prompts for vision-language tracking and provides extensive ablations, suggesting future expansion to more modalities and larger-scale underwater datasets.
Abstract
Underwater object tracking (UOT) is a foundational task for identifying and tracing submerged entities in underwater video sequences. However, current UOT datasets suffer from limitations in scale, diversity of target categories and scenarios covered, hindering the training and evaluation of modern tracking algorithms. To bridge this gap, we take the first step and introduce WebUOT-1M, \ie, the largest public UOT benchmark to date, sourced from complex and realistic underwater environments. It comprises 1.1 million frames across 1,500 video clips filtered from 408 target categories, largely surpassing previous UOT datasets, \eg, UVOT400. Through meticulous manual annotation and verification, we provide high-quality bounding boxes for underwater targets. Additionally, WebUOT-1M includes language prompts for video sequences, expanding its application areas, \eg, underwater vision-language tracking. Most existing trackers are tailored for open-air environments, leading to performance degradation when applied to UOT due to domain gaps. Retraining and fine-tuning these trackers are challenging due to sample imbalances and limited real-world underwater datasets. To tackle these challenges, we propose a novel omni-knowledge distillation framework based on WebUOT-1M, incorporating various strategies to guide the learning of the student Transformer. To the best of our knowledge, this framework is the first to effectively transfer open-air domain knowledge to the UOT model through knowledge distillation, as demonstrated by results on both existing UOT datasets and the newly proposed WebUOT-1M. Furthermore, we comprehensively evaluate WebUOT-1M using 30 deep trackers, showcasing its value as a benchmark for UOT research by presenting new challenges and opportunities for future studies. The complete dataset, codes and tracking results, will be made publicly available.
