Table of Contents
Fetching ...

Underwater-Art: Expanding Information Perspectives With Text Templates For Underwater Acoustic Target Recognition

Yuan Xie, Jiawei Ren, Ji Xu

TL;DR

This paper addresses the challenge of underwater acoustic target recognition under data scarcity and environmental variability by introducing UART, a tri-modal framework that learns from audio, spectrograms, and descriptive text templates. By converting contextual environmental information into natural language and applying contrastive learning across three modalities, UART achieves superior recognition and generalization, including strong few-shot and incomplete-annotation performance. The method demonstrates notable improvements on ShipSeAr and DeepShip datasets, with pre-training on richly annotated data yielding robust priors for downstream tasks. The work advances practical underwater recognition by leveraging language-based guidance to better utilize auxiliary information and diversify representation learning, paving the way for more flexible and data-efficient systems.

Abstract

Underwater acoustic target recognition is an intractable task due to the complex acoustic source characteristics and sound propagation patterns. Limited by insufficient data and narrow information perspective, recognition models based on deep learning seem far from satisfactory in practical underwater scenarios. Although underwater acoustic signals are severely influenced by distance, channel depth, or other factors, annotations of relevant information are often non-uniform, incomplete, and hard to use. In our work, we propose to implement Underwater Acoustic Recognition based on Templates made up of rich relevant information (hereinafter called "UART"). We design templates to integrate relevant information from different perspectives into descriptive natural language. UART adopts an audio-spectrogram-text tri-modal contrastive learning framework, which endows UART with the ability to guide the learning of acoustic representations by descriptive natural language. Our experiments reveal that UART has better recognition capability and generalization performance than traditional paradigms. Furthermore, the pre-trained UART model could provide superior prior knowledge for the recognition model in the scenario without any auxiliary annotation.

Underwater-Art: Expanding Information Perspectives With Text Templates For Underwater Acoustic Target Recognition

TL;DR

This paper addresses the challenge of underwater acoustic target recognition under data scarcity and environmental variability by introducing UART, a tri-modal framework that learns from audio, spectrograms, and descriptive text templates. By converting contextual environmental information into natural language and applying contrastive learning across three modalities, UART achieves superior recognition and generalization, including strong few-shot and incomplete-annotation performance. The method demonstrates notable improvements on ShipSeAr and DeepShip datasets, with pre-training on richly annotated data yielding robust priors for downstream tasks. The work advances practical underwater recognition by leveraging language-based guidance to better utilize auxiliary information and diversify representation learning, paving the way for more flexible and data-efficient systems.

Abstract

Underwater acoustic target recognition is an intractable task due to the complex acoustic source characteristics and sound propagation patterns. Limited by insufficient data and narrow information perspective, recognition models based on deep learning seem far from satisfactory in practical underwater scenarios. Although underwater acoustic signals are severely influenced by distance, channel depth, or other factors, annotations of relevant information are often non-uniform, incomplete, and hard to use. In our work, we propose to implement Underwater Acoustic Recognition based on Templates made up of rich relevant information (hereinafter called "UART"). We design templates to integrate relevant information from different perspectives into descriptive natural language. UART adopts an audio-spectrogram-text tri-modal contrastive learning framework, which endows UART with the ability to guide the learning of acoustic representations by descriptive natural language. Our experiments reveal that UART has better recognition capability and generalization performance than traditional paradigms. Furthermore, the pre-trained UART model could provide superior prior knowledge for the recognition model in the scenario without any auxiliary annotation.
Paper Structure (21 sections, 3 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 3 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: The framework of UART. UART consists of three encoders. The three encoders respectively map their inputs to embeddings of the same dimension. All embeddings are distributed in a shared embedding space.
  • Figure 2: The process of encoder-based tuning. Encoder-based tuning only uses the audio encoder. The text encoder and spectrogram encoder are abandoned. A task-specific classifier needs to be added after the audio encoder.
  • Figure 3: The process of inference. The label set to be predicted forms a candidate queue. All candidate labels are filled into the text template, then $N$ candidate embeddings are obtained via the text encoder. By calculating the similarity between audio embeddings and all candidate text embeddings, the one with the highest similarity is the prediction.
  • Figure 4: Time-frequency spectrograms of three samples belonging to the type of motorboat. Their respective distance, channel depth, and wind level information are marked in the figure.
  • Figure 5: The confusion matrix for classification on Shipsear (Dredger, Fishboat, Motorboat, Musselboat, Naturalnoise, Oceanliner, Passengers, RORO). The former represents the classification results using auxiliary information, while the latter represents the results using only labels. The depth of the color represents probability.
  • ...and 2 more figures