Table of Contents
Fetching ...

WebUOT-1M: Advancing Deep Underwater Object Tracking with A Million-Scale Benchmark

Chunhui Zhang, Li Liu, Guanjie Huang, Hao Wen, Xi Zhou, Yanfeng Wang

TL;DR

WebUOT-1M introduces the first million-scale underwater object tracking benchmark, addressing scale and diversity gaps in prior datasets by providing 1.1 million frames across 1,500 videos and 408 categories, with language prompts and 23 tracking attributes. To exploit cross-domain knowledge, the authors propose OKTrack, an omni-knowledge distillation framework that transfers open-air tracking knowledge from a multi-view teacher to a unimodal underwater student, augmented by a MATP module to mitigate drift. Evaluations on 30 trackers reveal Transformer-based and underwater-specific methods perform best, with substantial gains achieved through retraining and the proposed distillation approach, and highlight the dataset's potential to spur multimodal underwater tracking research. The work also demonstrates the utility of language prompts for vision-language tracking and provides extensive ablations, suggesting future expansion to more modalities and larger-scale underwater datasets.

Abstract

Underwater object tracking (UOT) is a foundational task for identifying and tracing submerged entities in underwater video sequences. However, current UOT datasets suffer from limitations in scale, diversity of target categories and scenarios covered, hindering the training and evaluation of modern tracking algorithms. To bridge this gap, we take the first step and introduce WebUOT-1M, \ie, the largest public UOT benchmark to date, sourced from complex and realistic underwater environments. It comprises 1.1 million frames across 1,500 video clips filtered from 408 target categories, largely surpassing previous UOT datasets, \eg, UVOT400. Through meticulous manual annotation and verification, we provide high-quality bounding boxes for underwater targets. Additionally, WebUOT-1M includes language prompts for video sequences, expanding its application areas, \eg, underwater vision-language tracking. Most existing trackers are tailored for open-air environments, leading to performance degradation when applied to UOT due to domain gaps. Retraining and fine-tuning these trackers are challenging due to sample imbalances and limited real-world underwater datasets. To tackle these challenges, we propose a novel omni-knowledge distillation framework based on WebUOT-1M, incorporating various strategies to guide the learning of the student Transformer. To the best of our knowledge, this framework is the first to effectively transfer open-air domain knowledge to the UOT model through knowledge distillation, as demonstrated by results on both existing UOT datasets and the newly proposed WebUOT-1M. Furthermore, we comprehensively evaluate WebUOT-1M using 30 deep trackers, showcasing its value as a benchmark for UOT research by presenting new challenges and opportunities for future studies. The complete dataset, codes and tracking results, will be made publicly available.

WebUOT-1M: Advancing Deep Underwater Object Tracking with A Million-Scale Benchmark

TL;DR

WebUOT-1M introduces the first million-scale underwater object tracking benchmark, addressing scale and diversity gaps in prior datasets by providing 1.1 million frames across 1,500 videos and 408 categories, with language prompts and 23 tracking attributes. To exploit cross-domain knowledge, the authors propose OKTrack, an omni-knowledge distillation framework that transfers open-air tracking knowledge from a multi-view teacher to a unimodal underwater student, augmented by a MATP module to mitigate drift. Evaluations on 30 trackers reveal Transformer-based and underwater-specific methods perform best, with substantial gains achieved through retraining and the proposed distillation approach, and highlight the dataset's potential to spur multimodal underwater tracking research. The work also demonstrates the utility of language prompts for vision-language tracking and provides extensive ablations, suggesting future expansion to more modalities and larger-scale underwater datasets.

Abstract

Underwater object tracking (UOT) is a foundational task for identifying and tracing submerged entities in underwater video sequences. However, current UOT datasets suffer from limitations in scale, diversity of target categories and scenarios covered, hindering the training and evaluation of modern tracking algorithms. To bridge this gap, we take the first step and introduce WebUOT-1M, \ie, the largest public UOT benchmark to date, sourced from complex and realistic underwater environments. It comprises 1.1 million frames across 1,500 video clips filtered from 408 target categories, largely surpassing previous UOT datasets, \eg, UVOT400. Through meticulous manual annotation and verification, we provide high-quality bounding boxes for underwater targets. Additionally, WebUOT-1M includes language prompts for video sequences, expanding its application areas, \eg, underwater vision-language tracking. Most existing trackers are tailored for open-air environments, leading to performance degradation when applied to UOT due to domain gaps. Retraining and fine-tuning these trackers are challenging due to sample imbalances and limited real-world underwater datasets. To tackle these challenges, we propose a novel omni-knowledge distillation framework based on WebUOT-1M, incorporating various strategies to guide the learning of the student Transformer. To the best of our knowledge, this framework is the first to effectively transfer open-air domain knowledge to the UOT model through knowledge distillation, as demonstrated by results on both existing UOT datasets and the newly proposed WebUOT-1M. Furthermore, we comprehensively evaluate WebUOT-1M using 30 deep trackers, showcasing its value as a benchmark for UOT research by presenting new challenges and opportunities for future studies. The complete dataset, codes and tracking results, will be made publicly available.
Paper Structure (45 sections, 1 equation, 18 figures, 13 tables, 1 algorithm)

This paper contains 45 sections, 1 equation, 18 figures, 13 tables, 1 algorithm.

Figures (18)

  • Figure 1: The proposed WebUOT-1M is much larger than existing UOT benchmarks kezebou2019underwaterpanetta2021comprehensivealawode2022utb180alawode2023improving.
  • Figure 2: A glance of some video sequences and annotations from the WebUOT-1M dataset. All sequences are divided into 12 superclasses, including amphibian, arthropod, bird, chordate, coelenterate, crustacean, fish, mollusc, person, mammal (except humans), reptile, and inanimate object.
  • Figure 3: We propose a challenging benchmark containing diverse object classes shown in word clouds, and the number of videos in each class group forms a long-tail distribution.
  • Figure 4: Statistics of WebUOT-1M. (a) Abundant underwater scenarios. (b) Distribution of normalized target center position. (c) Distribution of video length.
  • Figure 5: OKTrack overview. During training phase, we adopt four distillation losses (see Sec. \ref{['sec:okd']}). A training-free MATP module (see Sec. \ref{['sec:network']}) is used to enhance the tracking robustness of inference.
  • ...and 13 more figures