Table of Contents
Fetching ...

CLAP-S: Support Set Based Adaptation for Downstream Fiber-optic Acoustic Recognition

Jingchen Sun, Shaobo Han, Wataru Kohno, Changyou Chen

TL;DR

This work addresses the challenge of adapting contrastive language-audio pretraining (CLAP) to fiber-optic acoustic recognition, where significant domain shifts and limited labeled data hinder zero-shot transfer. It proposes CLAP-S, a memory-augmented adaptation that builds a Support Set from labeled data and retrieves explicit knowledge via cross-attention, then interpolates this with implicit knowledge from a fine-tuned CLAP. By exploring text-aligned and task-aligned embeddings and two combination strategies, the authors show that CLAP-S and especially CLAP-S$^{+}$ achieve strong performance on both lab-recorded fiber-optic ESC-50 variants and a real-world gunshot-firework dataset, with clear efficiency Trade-offs: CLAP-S is training-free and fast, while CLAP-S$^{+}$ yields the highest accuracy. The approach provides practical insights into balancing implicit and explicit knowledge for domain-shifted downstream tasks and includes release of code and a real dataset to support reproducibility, potentially informing similar adaptations in other sensing domains. ${p}_{final}(y|x)= (1-\\alpha){p}_{clap}(y|x,u) + \\alpha{p}_{support}(y|x,u)$ serves as the core fusion mechanism, combining memory retrieval with pre-trained knowledge to enhance generalization.

Abstract

Contrastive Language-Audio Pretraining (CLAP) models have demonstrated unprecedented performance in various acoustic signal recognition tasks. Fiber-optic-based acoustic recognition is one of the most important downstream tasks and plays a significant role in environmental sensing. Adapting CLAP for fiber-optic acoustic recognition has become an active research area. As a non-conventional acoustic sensor, fiber-optic acoustic recognition presents a challenging, domain-specific, low-shot deployment environment with significant domain shifts due to unique frequency response and noise characteristics. To address these challenges, we propose a support-based adaptation method, CLAP-S, which linearly interpolates a CLAP Adapter with the Support Set, leveraging both implicit knowledge through fine-tuning and explicit knowledge retrieved from memory for cross-domain generalization. Experimental results show that our method delivers competitive performance on both laboratory-recorded fiber-optic ESC-50 datasets and a real-world fiber-optic gunshot-firework dataset. Our research also provides valuable insights for other downstream acoustic recognition tasks. The code and gunshot-firework dataset are available at https://github.com/Jingchensun/clap-s.

CLAP-S: Support Set Based Adaptation for Downstream Fiber-optic Acoustic Recognition

TL;DR

This work addresses the challenge of adapting contrastive language-audio pretraining (CLAP) to fiber-optic acoustic recognition, where significant domain shifts and limited labeled data hinder zero-shot transfer. It proposes CLAP-S, a memory-augmented adaptation that builds a Support Set from labeled data and retrieves explicit knowledge via cross-attention, then interpolates this with implicit knowledge from a fine-tuned CLAP. By exploring text-aligned and task-aligned embeddings and two combination strategies, the authors show that CLAP-S and especially CLAP-S achieve strong performance on both lab-recorded fiber-optic ESC-50 variants and a real-world gunshot-firework dataset, with clear efficiency Trade-offs: CLAP-S is training-free and fast, while CLAP-S yields the highest accuracy. The approach provides practical insights into balancing implicit and explicit knowledge for domain-shifted downstream tasks and includes release of code and a real dataset to support reproducibility, potentially informing similar adaptations in other sensing domains. serves as the core fusion mechanism, combining memory retrieval with pre-trained knowledge to enhance generalization.

Abstract

Contrastive Language-Audio Pretraining (CLAP) models have demonstrated unprecedented performance in various acoustic signal recognition tasks. Fiber-optic-based acoustic recognition is one of the most important downstream tasks and plays a significant role in environmental sensing. Adapting CLAP for fiber-optic acoustic recognition has become an active research area. As a non-conventional acoustic sensor, fiber-optic acoustic recognition presents a challenging, domain-specific, low-shot deployment environment with significant domain shifts due to unique frequency response and noise characteristics. To address these challenges, we propose a support-based adaptation method, CLAP-S, which linearly interpolates a CLAP Adapter with the Support Set, leveraging both implicit knowledge through fine-tuning and explicit knowledge retrieved from memory for cross-domain generalization. Experimental results show that our method delivers competitive performance on both laboratory-recorded fiber-optic ESC-50 datasets and a real-world fiber-optic gunshot-firework dataset. Our research also provides valuable insights for other downstream acoustic recognition tasks. The code and gunshot-firework dataset are available at https://github.com/Jingchensun/clap-s.
Paper Structure (9 sections, 3 equations, 2 figures, 6 tables)

This paper contains 9 sections, 3 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The pipeline of our proposed method. A test sample is sent to the frozen pre-trained audio encoder and fine-tuned adapter to obtain the embedding, which is then used to perform cross-attention with the keys from the support audio samples. The attention weights are further multiplied by the values of the support set to serve as Explicit Knowledge. The final prediction is obtained by Linear interpolation with the Explicit Knowledge and the Implicit Knowledge captured by a fine-tuned Adapter .
  • Figure 2: The Few-Shot Adaptation Results.