Table of Contents
Fetching ...

Underwater Camouflaged Object Tracking Meets Vision-Language SAM2

Chunhui Zhang, Li Liu, Guanjie Huang, Zhipeng Zhang, Hao Wen, Xi Zhou, Shiming Ge, Yanfeng Wang

TL;DR

The paper tackles the challenge of underwater camouflaged object tracking by introducing UW-COT220, the first large-scale multi-modal underwater benchmark with bounding boxes, masks, and language descriptions, spanning 96 categories and ~159K frames. It proposes VL-SAM2, a vision-language tracker built on the SAM2 foundation that fuses image and language cues via a language branch and a Kalman-filter-based MATP to mitigate drift, evaluated in a zero-shot setting. Empirical results show VL-SAM2 achieves state-of-the-art performance on UW-COT220 and generalizes well to LaSOT and WebUOT-1M, with ablations confirming the benefits of the language prompt and MATP components. The work highlights the value of vision-language prompts and video foundation models for robust underwater tracking and sets a new benchmark for multi-modal underwater perception tasks.

Abstract

Over the past decade, significant progress has been made in visual object tracking, largely due to the availability of large-scale datasets. However, these datasets have primarily focused on open-air scenarios and have largely overlooked underwater animal tracking-especially the complex challenges posed by camouflaged marine animals. To bridge this gap, we take a step forward by proposing the first large-scale multi-modal underwater camouflaged object tracking dataset, namely UW-COT220. Based on the proposed dataset, this work first comprehensively evaluates current advanced visual object tracking methods, including SAM- and SAM2-based trackers, in challenging underwater environments, \eg, coral reefs. Our findings highlight the improvements of SAM2 over SAM, demonstrating its enhanced ability to handle the complexities of underwater camouflaged objects. Furthermore, we propose a novel vision-language tracking framework called VL-SAM2, based on the video foundation model SAM2. Extensive experimental results demonstrate that the proposed VL-SAM2 achieves state-of-the-art performance across underwater and open-air object tracking datasets. The dataset and codes are available at~{\color{magenta}{https://github.com/983632847/Awesome-Multimodal-Object-Tracking}}.

Underwater Camouflaged Object Tracking Meets Vision-Language SAM2

TL;DR

The paper tackles the challenge of underwater camouflaged object tracking by introducing UW-COT220, the first large-scale multi-modal underwater benchmark with bounding boxes, masks, and language descriptions, spanning 96 categories and ~159K frames. It proposes VL-SAM2, a vision-language tracker built on the SAM2 foundation that fuses image and language cues via a language branch and a Kalman-filter-based MATP to mitigate drift, evaluated in a zero-shot setting. Empirical results show VL-SAM2 achieves state-of-the-art performance on UW-COT220 and generalizes well to LaSOT and WebUOT-1M, with ablations confirming the benefits of the language prompt and MATP components. The work highlights the value of vision-language prompts and video foundation models for robust underwater tracking and sets a new benchmark for multi-modal underwater perception tasks.

Abstract

Over the past decade, significant progress has been made in visual object tracking, largely due to the availability of large-scale datasets. However, these datasets have primarily focused on open-air scenarios and have largely overlooked underwater animal tracking-especially the complex challenges posed by camouflaged marine animals. To bridge this gap, we take a step forward by proposing the first large-scale multi-modal underwater camouflaged object tracking dataset, namely UW-COT220. Based on the proposed dataset, this work first comprehensively evaluates current advanced visual object tracking methods, including SAM- and SAM2-based trackers, in challenging underwater environments, \eg, coral reefs. Our findings highlight the improvements of SAM2 over SAM, demonstrating its enhanced ability to handle the complexities of underwater camouflaged objects. Furthermore, we propose a novel vision-language tracking framework called VL-SAM2, based on the video foundation model SAM2. Extensive experimental results demonstrate that the proposed VL-SAM2 achieves state-of-the-art performance across underwater and open-air object tracking datasets. The dataset and codes are available at~{\color{magenta}{https://github.com/983632847/Awesome-Multimodal-Object-Tracking}}.
Paper Structure (5 sections, 4 figures, 5 tables, 1 algorithm)

This paper contains 5 sections, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: A glance at some examples covering diverse underwater scenes in our UW-COT220 dataset. We provide extensive annotations for underwater camouflaged target objects in video sequences, including bounding boxes, masks, and language descriptions.
  • Figure 2: Overview of the proposed vision-language tracking framework VL-SAM2. Our VL-SAM2 retains the core architecture of SAM2 and freezes the weights of the image encoder. The language prompt is encoded by the language encoder and then injected into the mask decoder. Additionally, a training-free MATP module is integrated to enhance tracking robustness during inference.
  • Figure 3: Comparison of SOTA trackers on the UW-COT220 dataset using AUC, Pre, cAUC, and nPre scores. Best viewed by zooming in.
  • Figure 4: Ablation study on the UW-COT220 dataset.