Table of Contents
Fetching ...

SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

Chaoqun Liu, Mahani Aljunied, Guizhen Chen, Hou Pong Chan, Weiwen Xu, Yu Rong, Wenxuan Zhang

TL;DR

SeaLLMs-Audio introduces a large multilingual audio-language model tailored for Southeast Asia, extending support to Indonesian, Thai, Vietnamese, English, and Chinese, and trains on a comprehensive multimodal dataset. The approach combines a Qwen-based audio encoder with a multilingual LM, including a newly initialized audio adapter, and optimizes with the objective $P_{ heta}(x_t mid x_{<t}, a)$ over a one-epoch run on substantial hardware. To enable standardized evaluation in the region, SeaBench-Audio provides a 14-task benchmark and an LLM-as-a-judge framework (Gemini) for scalable scoring across audio-centric tasks. The results show SeaLLMs-Audio achieving competitive performance and strong language quality on SEA tasks, signaling meaningful progress for regional LALMs, while recognizing limitations in language coverage and potential language mixing that future RLHF-based refinements could mitigate.

Abstract

We introduce SeaLLMs-Audio, the first large audio-language model (LALM) tailored for multiple Southeast Asian (SEA) languages-Indonesian (id), Thai (th), and Vietnamese (vi)-alongside English (en) and Chinese (zh). Trained on a large-scale audio corpus, SeaLLMs-Audio exhibits strong performance across diverse audio-centric tasks, spanning fine-grained audio understanding and voice-based interaction. Its key features include: 1) Multilingual: the model primarily supports 5 languages, namely Indonesian, Thai, Vietnamese, English, and Chinese; 2) Multimodal: the model accepts flexible input modalities, including audio only, text only, as well as audio with text; 3) Multi-task: the model supports a wide range of tasks, including audio analysis tasks such as Audio Captioning, Automatic Speech Recognition, Speech-to-Text Translation, Speech Emotion Recognition, Speech Question Answering, and Speech Summarization. It also enables voice-based dialogue, including answering factual, mathematical, and general knowledge queries. As a significant step towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs-Audio to benefit both the regional research community and industry. To automate LALM evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark spanning multiple tasks. Experiments show that SeaLLMs-Audio achieves competitive performance compared with other LALMs on SEA languages.

SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

TL;DR

SeaLLMs-Audio introduces a large multilingual audio-language model tailored for Southeast Asia, extending support to Indonesian, Thai, Vietnamese, English, and Chinese, and trains on a comprehensive multimodal dataset. The approach combines a Qwen-based audio encoder with a multilingual LM, including a newly initialized audio adapter, and optimizes with the objective over a one-epoch run on substantial hardware. To enable standardized evaluation in the region, SeaBench-Audio provides a 14-task benchmark and an LLM-as-a-judge framework (Gemini) for scalable scoring across audio-centric tasks. The results show SeaLLMs-Audio achieving competitive performance and strong language quality on SEA tasks, signaling meaningful progress for regional LALMs, while recognizing limitations in language coverage and potential language mixing that future RLHF-based refinements could mitigate.

Abstract

We introduce SeaLLMs-Audio, the first large audio-language model (LALM) tailored for multiple Southeast Asian (SEA) languages-Indonesian (id), Thai (th), and Vietnamese (vi)-alongside English (en) and Chinese (zh). Trained on a large-scale audio corpus, SeaLLMs-Audio exhibits strong performance across diverse audio-centric tasks, spanning fine-grained audio understanding and voice-based interaction. Its key features include: 1) Multilingual: the model primarily supports 5 languages, namely Indonesian, Thai, Vietnamese, English, and Chinese; 2) Multimodal: the model accepts flexible input modalities, including audio only, text only, as well as audio with text; 3) Multi-task: the model supports a wide range of tasks, including audio analysis tasks such as Audio Captioning, Automatic Speech Recognition, Speech-to-Text Translation, Speech Emotion Recognition, Speech Question Answering, and Speech Summarization. It also enables voice-based dialogue, including answering factual, mathematical, and general knowledge queries. As a significant step towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs-Audio to benefit both the regional research community and industry. To automate LALM evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark spanning multiple tasks. Experiments show that SeaLLMs-Audio achieves competitive performance compared with other LALMs on SEA languages.

Paper Structure

This paper contains 25 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of data curation process for SeaLLMs-Audio.
  • Figure 2: Training data distribution across (a) languages and (b) task types.
  • Figure 3: Architecture of SeaLLMs-Audio.
  • Figure 4: (a) An overview of the task in SeaBench-Audio (b) Evaluation pipeline with LLM-as-a-judge framework.
  • Figure 5: Performance of the models on SeaBench-Audio accessed by human evaluators: (a) average scores for overall performance, and (b) average scores for output language quality. Each response is evaluated on a 1–5 scale, with 5 indicating the highest quality. Human evaluations were performed blind, without disclosure of the generating model.
  • ...and 3 more figures