Table of Contents
Fetching ...

SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object Tracking

Yunfeng Li, Bo Wang, Jiahao Wan, Xueyi Wu, Ye Li

TL;DR

This work introduces SonarT165, the first large-scale underwater acoustic object tracking benchmark, featuring 330 sequences (165 square, 165 fan) and 205K annotations to reflect acoustic imaging challenges. It also presents STFTrack, a specialized UAOT framework built on LiteTrack that incorporates a Frequency Enhancement Module, a Multi-view Template Fusion Module (MTFM), and an Optimal Trajectory Correction Module (OTCM) to address high noise, low texture, and trajectory drift in sonar imagery. Comprehensive experiments show STFTrack achieves state-of-the-art performance on SonarT165 among both general and lightweight trackers, with ablations validating the contribution of each module and the benefit of acoustic-specific image enhancement. The work provides a practical, scalable benchmark and a robust tracking pipeline that advances the deployment of acoustic vision in underwater observation systems. The public code at https://github.com/LiYunfengLYF/SonarT165 facilitates replication and further research.

Abstract

Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/SonarT165.

SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object Tracking

TL;DR

This work introduces SonarT165, the first large-scale underwater acoustic object tracking benchmark, featuring 330 sequences (165 square, 165 fan) and 205K annotations to reflect acoustic imaging challenges. It also presents STFTrack, a specialized UAOT framework built on LiteTrack that incorporates a Frequency Enhancement Module, a Multi-view Template Fusion Module (MTFM), and an Optimal Trajectory Correction Module (OTCM) to address high noise, low texture, and trajectory drift in sonar imagery. Comprehensive experiments show STFTrack achieves state-of-the-art performance on SonarT165 among both general and lightweight trackers, with ablations validating the contribution of each module and the benefit of acoustic-specific image enhancement. The work provides a practical, scalable benchmark and a robust tracking pipeline that advances the deployment of acoustic vision in underwater observation systems. The public code at https://github.com/LiYunfengLYF/SonarT165 facilitates replication and further research.

Abstract

Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/SonarT165.

Paper Structure

This paper contains 41 sections, 14 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: When underwater visibility is sufficient (in figure (a)), vehicle can use underwater camera and sonar system to jointly locate the tracked target, such as RGB-Sonar tracking rgbs50 task. When underwater visibility is insufficient (in figure (b)), vehicle needs to rely on sonar alone to locate the target, which is the underwater acoustic object tracking (UAOT) task.
  • Figure 2: Main introduction of the proposed SonarT165 benchmark. (a) Data collection platform in the pool. (b) Sequence level proportion of different objects. (c) Sequence level statistics of different attributes. (d) Data collection platform in the field environment. (e) Frame level proportion of different objects. (f) Frame level statistics of different attributes.
  • Figure 3: Visualization of different attributes of the proposed SonarT165 benchmark. To show more intuitively the challenges they pose to the tracker, we show them in the search area. (a) Acoustic object crossover . (b) Similar object. (c) out-of-view. (d) Small target. (e) Scale variant. (f) Appearance change. (g) Low acoustic reflection. (h) Target brightness change. (i) Background interference. (j) Field environment.
  • Figure 4: Visualization of bounding box distribution. (a) represents the distribution of the first frame bounding box in the fan sequences. (b) represents the distribution of all bounding boxes in the fan sequences. (c) represents the distribution of the first frame bounding box in the square sequences. (b) represents the distribution of all bounding boxes in the square sequences. (e) represents the square root curve of the width and height of bounding boxes in two types of sequences. (f) represents the width-height ratio curve of bounding boxes in two types of sequences.
  • Figure 5: The overall framework of STFTrack. We take SOT pre-trained Litetracklitetrack as baseline. During the tracking phase, we first enhance the high frequency information of the sonar image and then encode the image and input it into the backbone. Then the search area features are input into the frequency enhancement module and then into the prediction head to obtain the target state. Then the predicted target state, target history state and acoustic response map are input into the trajectory correction module and output the bracketing frame. Then, we use the current frame bracket to obtain the dynamic template, and input the fixed template and dynamic template into the template fusion module to fuse the template features.
  • ...and 9 more figures