Table of Contents
Fetching ...

LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs

Pooneh Mousavi, Shubham Gupta, Cem Subakan, Mirco Ravanelli

TL;DR

LiSTEN addresses the challenge of adapting LLMs to audio-language tasks without heavy reliance on ASR data by introducing dynamic prompt selection over a pool of soft tokens. It integrates Whisper and BEATs for audio encoding, a Q-Former to align audio features to the LLM, and a learnable prompt pool whose keys determine task-adaptive prompts, with only the Q-Former and prompt tokens trained. Across MT tasks (ASR, En2Zh, ER, SV, SQA, ACAP), DPS—especially similarity-based with stochastic training—outperforms LoRA and soft prompts, achieving competitive results with far fewer trainable parameters and enabling smaller inference prompts. This approach yields data-efficient, scalable, and interpretable all-in-one audio-language models capable of rapid adaptation to new tasks with reduced training data and computational costs.

Abstract

Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.

LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs

TL;DR

LiSTEN addresses the challenge of adapting LLMs to audio-language tasks without heavy reliance on ASR data by introducing dynamic prompt selection over a pool of soft tokens. It integrates Whisper and BEATs for audio encoding, a Q-Former to align audio features to the LLM, and a learnable prompt pool whose keys determine task-adaptive prompts, with only the Q-Former and prompt tokens trained. Across MT tasks (ASR, En2Zh, ER, SV, SQA, ACAP), DPS—especially similarity-based with stochastic training—outperforms LoRA and soft prompts, achieving competitive results with far fewer trainable parameters and enabling smaller inference prompts. This approach yields data-efficient, scalable, and interpretable all-in-one audio-language models capable of rapid adaptation to new tasks with reduced training data and computational costs.

Abstract

Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.

Paper Structure

This paper contains 9 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: LiSTEN pipeline with Dynamic Prompt Selection (DPS). The prompt pool consists of key-value pairs. Audio is encoded using Whisper and BEATs, whose embeddings are concatenated and processed through a Q-Former. The task is represented as text and encoded with the backbone LLM tokenizer. The speech and text tokens are mean-pooled to obtain a query, which selects $k$ values from the prompt pool. These selected values serve as the soft prompt for the instance and are prepended to the input before being passed to the LLM.
  • Figure 2: Token usage distribution across tasks in the test set. The z-axis represents token frequency, the x-axis shows token indices sorted by frequency for ASR, and the y-axis indicates the task. A prompt pool of 400 tokens was used, with each instance selecting 10 tokens at inference.