Underwater Camouflaged Object Tracking Meets Vision-Language SAM2
Chunhui Zhang, Li Liu, Guanjie Huang, Zhipeng Zhang, Hao Wen, Xi Zhou, Shiming Ge, Yanfeng Wang
TL;DR
The paper tackles the challenge of underwater camouflaged object tracking by introducing UW-COT220, the first large-scale multi-modal underwater benchmark with bounding boxes, masks, and language descriptions, spanning 96 categories and ~159K frames. It proposes VL-SAM2, a vision-language tracker built on the SAM2 foundation that fuses image and language cues via a language branch and a Kalman-filter-based MATP to mitigate drift, evaluated in a zero-shot setting. Empirical results show VL-SAM2 achieves state-of-the-art performance on UW-COT220 and generalizes well to LaSOT and WebUOT-1M, with ablations confirming the benefits of the language prompt and MATP components. The work highlights the value of vision-language prompts and video foundation models for robust underwater tracking and sets a new benchmark for multi-modal underwater perception tasks.
Abstract
Over the past decade, significant progress has been made in visual object tracking, largely due to the availability of large-scale datasets. However, these datasets have primarily focused on open-air scenarios and have largely overlooked underwater animal tracking-especially the complex challenges posed by camouflaged marine animals. To bridge this gap, we take a step forward by proposing the first large-scale multi-modal underwater camouflaged object tracking dataset, namely UW-COT220. Based on the proposed dataset, this work first comprehensively evaluates current advanced visual object tracking methods, including SAM- and SAM2-based trackers, in challenging underwater environments, \eg, coral reefs. Our findings highlight the improvements of SAM2 over SAM, demonstrating its enhanced ability to handle the complexities of underwater camouflaged objects. Furthermore, we propose a novel vision-language tracking framework called VL-SAM2, based on the video foundation model SAM2. Extensive experimental results demonstrate that the proposed VL-SAM2 achieves state-of-the-art performance across underwater and open-air object tracking datasets. The dataset and codes are available at~{\color{magenta}{https://github.com/983632847/Awesome-Multimodal-Object-Tracking}}.
