LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
Pooneh Mousavi, Shubham Gupta, Cem Subakan, Mirco Ravanelli
TL;DR
LiSTEN addresses the challenge of adapting LLMs to audio-language tasks without heavy reliance on ASR data by introducing dynamic prompt selection over a pool of soft tokens. It integrates Whisper and BEATs for audio encoding, a Q-Former to align audio features to the LLM, and a learnable prompt pool whose keys determine task-adaptive prompts, with only the Q-Former and prompt tokens trained. Across MT tasks (ASR, En2Zh, ER, SV, SQA, ACAP), DPS—especially similarity-based with stochastic training—outperforms LoRA and soft prompts, achieving competitive results with far fewer trainable parameters and enabling smaller inference prompts. This approach yields data-efficient, scalable, and interpretable all-in-one audio-language models capable of rapid adaptation to new tasks with reduced training data and computational costs.
Abstract
Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.
