CLAP-S: Support Set Based Adaptation for Downstream Fiber-optic Acoustic Recognition
Jingchen Sun, Shaobo Han, Wataru Kohno, Changyou Chen
TL;DR
This work addresses the challenge of adapting contrastive language-audio pretraining (CLAP) to fiber-optic acoustic recognition, where significant domain shifts and limited labeled data hinder zero-shot transfer. It proposes CLAP-S, a memory-augmented adaptation that builds a Support Set from labeled data and retrieves explicit knowledge via cross-attention, then interpolates this with implicit knowledge from a fine-tuned CLAP. By exploring text-aligned and task-aligned embeddings and two combination strategies, the authors show that CLAP-S and especially CLAP-S$^{+}$ achieve strong performance on both lab-recorded fiber-optic ESC-50 variants and a real-world gunshot-firework dataset, with clear efficiency Trade-offs: CLAP-S is training-free and fast, while CLAP-S$^{+}$ yields the highest accuracy. The approach provides practical insights into balancing implicit and explicit knowledge for domain-shifted downstream tasks and includes release of code and a real dataset to support reproducibility, potentially informing similar adaptations in other sensing domains. ${p}_{final}(y|x)= (1-\\alpha){p}_{clap}(y|x,u) + \\alpha{p}_{support}(y|x,u)$ serves as the core fusion mechanism, combining memory retrieval with pre-trained knowledge to enhance generalization.
Abstract
Contrastive Language-Audio Pretraining (CLAP) models have demonstrated unprecedented performance in various acoustic signal recognition tasks. Fiber-optic-based acoustic recognition is one of the most important downstream tasks and plays a significant role in environmental sensing. Adapting CLAP for fiber-optic acoustic recognition has become an active research area. As a non-conventional acoustic sensor, fiber-optic acoustic recognition presents a challenging, domain-specific, low-shot deployment environment with significant domain shifts due to unique frequency response and noise characteristics. To address these challenges, we propose a support-based adaptation method, CLAP-S, which linearly interpolates a CLAP Adapter with the Support Set, leveraging both implicit knowledge through fine-tuning and explicit knowledge retrieved from memory for cross-domain generalization. Experimental results show that our method delivers competitive performance on both laboratory-recorded fiber-optic ESC-50 datasets and a real-world fiber-optic gunshot-firework dataset. Our research also provides valuable insights for other downstream acoustic recognition tasks. The code and gunshot-firework dataset are available at https://github.com/Jingchensun/clap-s.
