Table of Contents
Fetching ...

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul

TL;DR

The paper investigates enhancing Low-Resource Language and Instruction Following in audio-language models by balancing English and Thai training data through data mixtures and unifying audio understanding with speech instruction-following in a Typhoon-based architecture. It builds Typhoon-Audio using Whisper-Th and BEATs with a Q-Former adapter and Typhoon LLM, trained with 1.82M pre-training examples and 0.64M SFT examples to perform ASR, translation, audio captioning, and speech instruction tasks in English and Thai. Results show Thai performance gaps in open models, with Typhoon-Audio achieving strong Thai-English performance and competitive results against Gemini-1.5-Pro, and Typhoon2-Audio offering further improvements and reduced hallucinations, highlighting the importance of base LLM quality. The findings demonstrate that targeted data mixtures and multi-task fine-tuning can substantially improve instruction-following in low-resource languages, enabling open-source audio-language models to rival proprietary systems on a range of tasks while also identifying robustness bottlenecks like background noise.

Abstract

Audio language models process audio inputs using textual prompts for tasks like speech recognition and audio captioning. Although built on multilingual pre-trained components, most are trained primarily on English, limiting their usability for other languages. This paper evaluates audio language models on Thai, a low-resource language, and finds that they lack emergent cross-lingual abilities despite their multilingual foundations. To address this, we explore data mixtures that optimize audio language models for both a target language and English while integrating audio comprehension and speech instruction-following into a unified model. Our experiments provide insights into improving instruction-following in low-resource languages by balancing language-specific and multilingual training data. The proposed model, Typhoon-Audio, significantly outperforms existing open-source models and achieves performance comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai.

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

TL;DR

The paper investigates enhancing Low-Resource Language and Instruction Following in audio-language models by balancing English and Thai training data through data mixtures and unifying audio understanding with speech instruction-following in a Typhoon-based architecture. It builds Typhoon-Audio using Whisper-Th and BEATs with a Q-Former adapter and Typhoon LLM, trained with 1.82M pre-training examples and 0.64M SFT examples to perform ASR, translation, audio captioning, and speech instruction tasks in English and Thai. Results show Thai performance gaps in open models, with Typhoon-Audio achieving strong Thai-English performance and competitive results against Gemini-1.5-Pro, and Typhoon2-Audio offering further improvements and reduced hallucinations, highlighting the importance of base LLM quality. The findings demonstrate that targeted data mixtures and multi-task fine-tuning can substantially improve instruction-following in low-resource languages, enabling open-source audio-language models to rival proprietary systems on a range of tasks while also identifying robustness bottlenecks like background noise.

Abstract

Audio language models process audio inputs using textual prompts for tasks like speech recognition and audio captioning. Although built on multilingual pre-trained components, most are trained primarily on English, limiting their usability for other languages. This paper evaluates audio language models on Thai, a low-resource language, and finds that they lack emergent cross-lingual abilities despite their multilingual foundations. To address this, we explore data mixtures that optimize audio language models for both a target language and English while integrating audio comprehension and speech instruction-following into a unified model. Our experiments provide insights into improving instruction-following in low-resource languages by balancing language-specific and multilingual training data. The proposed model, Typhoon-Audio, significantly outperforms existing open-source models and achieves performance comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai.
Paper Structure (10 sections, 2 figures, 5 tables)

This paper contains 10 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The model architecture of Typhoon-Audio. The audio encoder consists of a Whisper encoder and a BEATs encoder. The adapter is based on a window-level Q-Former. The LLM is our Typhoon model.
  • Figure 2: Speech Instruction Following Data Creation Pipeline