Table of Contents
Fetching ...

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, Helen Meng

TL;DR

The paper introduces LLM-Codec, a three-layer residual vector quantization codec that compresses audio into the token space of a frozen large language model (LLM), enabling cross-modal in-context learning for unseen audio tasks without parameter updates. By mapping audio to a lexical token sequence (approximately 57 tokens per second) and aligning the first layer semantically, the approach allows LLAMA-2 based prompts to perform both audio understanding and generation tasks in a few-shot setting. Key contributions include the semantic-guided RVQ design, fixed codebooks initialized from the LLM vocabulary, and semantic/consistency losses that stabilize training, along with an open-source release of LLM-Codec. Overall, UniAudio 1.5 demonstrates feasible few-shot cross-modal capabilities across topics like speech emotion classification, sound event detection, and simple text-to-speech, highlighting a path toward universal audio foundation models with minimal fine-tuning.

Abstract

The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-driven audio codec model, LLM-Codec, to transfer the audio modality into the textual space, \textit{i.e.} representing audio tokens with words or sub-words in the vocabulary of LLMs, while keeping high audio reconstruction quality. The key idea is to reduce the modality heterogeneity between text and audio by compressing the audio modality into a well-trained LLMs token space. Thus, the audio representation can be viewed as a new \textit{foreign language}, and LLMs can learn the new \textit{foreign language} with several demonstrations. In experiments, we investigate the performance of the proposed approach across multiple audio understanding and generation tasks, \textit{e.g.} speech emotion classification, audio classification, text-to-speech generation, speech enhancement, etc. The experimental results demonstrate that the LLMs equipped with the proposed LLM-Codec, named as UniAudio 1.5, prompted by only a few examples, can achieve the expected functions in simple scenarios. It validates the feasibility and effectiveness of the proposed cross-modal in-context learning approach. To facilitate research on few-shot audio task learning and multi-modal LLMs, we have open-sourced the LLM-Codec model.

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

TL;DR

The paper introduces LLM-Codec, a three-layer residual vector quantization codec that compresses audio into the token space of a frozen large language model (LLM), enabling cross-modal in-context learning for unseen audio tasks without parameter updates. By mapping audio to a lexical token sequence (approximately 57 tokens per second) and aligning the first layer semantically, the approach allows LLAMA-2 based prompts to perform both audio understanding and generation tasks in a few-shot setting. Key contributions include the semantic-guided RVQ design, fixed codebooks initialized from the LLM vocabulary, and semantic/consistency losses that stabilize training, along with an open-source release of LLM-Codec. Overall, UniAudio 1.5 demonstrates feasible few-shot cross-modal capabilities across topics like speech emotion classification, sound event detection, and simple text-to-speech, highlighting a path toward universal audio foundation models with minimal fine-tuning.

Abstract

The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-driven audio codec model, LLM-Codec, to transfer the audio modality into the textual space, \textit{i.e.} representing audio tokens with words or sub-words in the vocabulary of LLMs, while keeping high audio reconstruction quality. The key idea is to reduce the modality heterogeneity between text and audio by compressing the audio modality into a well-trained LLMs token space. Thus, the audio representation can be viewed as a new \textit{foreign language}, and LLMs can learn the new \textit{foreign language} with several demonstrations. In experiments, we investigate the performance of the proposed approach across multiple audio understanding and generation tasks, \textit{e.g.} speech emotion classification, audio classification, text-to-speech generation, speech enhancement, etc. The experimental results demonstrate that the LLMs equipped with the proposed LLM-Codec, named as UniAudio 1.5, prompted by only a few examples, can achieve the expected functions in simple scenarios. It validates the feasibility and effectiveness of the proposed cross-modal in-context learning approach. To facilitate research on few-shot audio task learning and multi-modal LLMs, we have open-sourced the LLM-Codec model.
Paper Structure (31 sections, 6 equations, 6 figures, 8 tables)

This paper contains 31 sections, 6 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: This figure illustrates the framework of the proposed approach (UniAudio 1.5) to conduct speech emotion classification and simple text-to-speech generation tasks. The data format will be $\{x_1,y_1, x_2,y_2, ..., x_q\}$, which means the previous samples $\{ x_i, y_i \}$ is the demonstration of this task, the LLAMA model is asked to predict $y_q$. $y_q$ can be the text or audio.
  • Figure 2: A high-level overview of LLM-Codec. Sub denotes the feature subtraction. We assume 3 RVQ layers are used in our study. In practice, we can use different RVQ layer settings.
  • Figure 3: Examples of simple text-to-speech generation using LLM-Codec and LLAMA2 model.
  • Figure 4: The token visualization with LLM-Codec. The audio samples are from the ESC50 dataset.
  • Figure 5: Examples of simple text-to-sound generation on FSDD dataset using LLM-Codec with a frozen LLAMA2 7B model.
  • ...and 1 more figures