SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object Tracking
Yunfeng Li, Bo Wang, Jiahao Wan, Xueyi Wu, Ye Li
TL;DR
This work introduces SonarT165, the first large-scale underwater acoustic object tracking benchmark, featuring 330 sequences (165 square, 165 fan) and 205K annotations to reflect acoustic imaging challenges. It also presents STFTrack, a specialized UAOT framework built on LiteTrack that incorporates a Frequency Enhancement Module, a Multi-view Template Fusion Module (MTFM), and an Optimal Trajectory Correction Module (OTCM) to address high noise, low texture, and trajectory drift in sonar imagery. Comprehensive experiments show STFTrack achieves state-of-the-art performance on SonarT165 among both general and lightweight trackers, with ablations validating the contribution of each module and the benefit of acoustic-specific image enhancement. The work provides a practical, scalable benchmark and a robust tracking pipeline that advances the deployment of acoustic vision in underwater observation systems. The public code at https://github.com/LiYunfengLYF/SonarT165 facilitates replication and further research.
Abstract
Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/SonarT165.
