Table of Contents
Fetching ...

Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech Data

Youngwon Choi, Jaeyoon Jung, Hyeonyu Kim, Huu-Kim Nguyen, Hwayeon Kim

TL;DR

This work tackles fine-tuning large audio language models for end-to-end spoken language understanding under limited speech data. It systematically compares text-only, direct mixing, and curriculum-learning fine-tuning schemes, showing that text-only baselines are already strong and that incorporating a small fraction of speech data yields substantial gains, particularly when data are scarce. Curriculum learning outperforms direct mixing at low-resource levels, while both schemes converge with more abundant speech data; in cross-lingual SLU, leveraging source-language speech with target-language text and minimal target speech enables effective adaptation across languages. The findings provide practical guidance for deploying LALMs in data-constrained SLU settings and underscore the potential of cross-lingual transfer when text resources are plentiful.

Abstract

Large Audio Language Models (LALMs) have emerged as powerful tools for speech-related tasks but remain underexplored for fine-tuning, especially with limited speech data. To bridge this gap, we systematically examine how different fine-tuning schemes including text-only, direct mixing, and curriculum learning affect spoken language understanding (SLU), focusing on scenarios where text-label pairs are abundant while paired speech-label data are limited. Results show that LALMs already achieve competitive performance with text-only fine-tuning, highlighting their strong generalization ability. Adding even small amounts of speech data (2-5%) yields substantial further gains, with curriculum learning particularly effective under scarce data. In cross-lingual SLU, combining source-language speech data with target-language text and minimal target-language speech data enables effective adaptation. Overall, this study provides practical insights into the LALM fine-tuning under realistic data constraints.

Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech Data

TL;DR

This work tackles fine-tuning large audio language models for end-to-end spoken language understanding under limited speech data. It systematically compares text-only, direct mixing, and curriculum-learning fine-tuning schemes, showing that text-only baselines are already strong and that incorporating a small fraction of speech data yields substantial gains, particularly when data are scarce. Curriculum learning outperforms direct mixing at low-resource levels, while both schemes converge with more abundant speech data; in cross-lingual SLU, leveraging source-language speech with target-language text and minimal target speech enables effective adaptation across languages. The findings provide practical guidance for deploying LALMs in data-constrained SLU settings and underscore the potential of cross-lingual transfer when text resources are plentiful.

Abstract

Large Audio Language Models (LALMs) have emerged as powerful tools for speech-related tasks but remain underexplored for fine-tuning, especially with limited speech data. To bridge this gap, we systematically examine how different fine-tuning schemes including text-only, direct mixing, and curriculum learning affect spoken language understanding (SLU), focusing on scenarios where text-label pairs are abundant while paired speech-label data are limited. Results show that LALMs already achieve competitive performance with text-only fine-tuning, highlighting their strong generalization ability. Adding even small amounts of speech data (2-5%) yields substantial further gains, with curriculum learning particularly effective under scarce data. In cross-lingual SLU, combining source-language speech data with target-language text and minimal target-language speech data enables effective adaptation. Overall, this study provides practical insights into the LALM fine-tuning under realistic data constraints.

Paper Structure

This paper contains 10 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparison of fine-tuning schemes for SLU. (a) shows the unified task/prompt format. (b) illustrates the text-only, and (c) the direct mixing. Curriculum learning applies (b) in early epochs and (c) in the final epoch.
  • Figure 2: Zero-shot cross-lingual SLU from French to eleven target languages, reported as relative improvement in SLU-F1 over the text-only scheme.