Table of Contents
Fetching ...

Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data

Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid

TL;DR

Falcon3-Audio presents a data-efficient, single-stage end-to-end audio-language system that aligns Whisper-derived audio features with instruction-tuned Falcon3 LLMs using a lightweight projection bridge and LoRA fine-tuning. Trained exclusively on public data (Open-ASQA and a synthetic voice-instruction set), it achieves competitive and, in some cases, state-of-the-art results on MMAU and AIR-Bench benchmarks while using far less data than contemporaries. The study systematically ablates design choices, showing that complex curricula, multiple encoders, or heavy data curation are not necessary for strong cross-domain audio understanding. This work demonstrates the practicality and reproducibility of open-resource ALMs, enabling scalable development of audio-language reasoning with transparent training and evaluation.

Abstract

Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored despite audio's centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data, less than 30K hours (5K unique), Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors are not required for strong performance, even compared to models trained on over 500K hours of data.

Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data

TL;DR

Falcon3-Audio presents a data-efficient, single-stage end-to-end audio-language system that aligns Whisper-derived audio features with instruction-tuned Falcon3 LLMs using a lightweight projection bridge and LoRA fine-tuning. Trained exclusively on public data (Open-ASQA and a synthetic voice-instruction set), it achieves competitive and, in some cases, state-of-the-art results on MMAU and AIR-Bench benchmarks while using far less data than contemporaries. The study systematically ablates design choices, showing that complex curricula, multiple encoders, or heavy data curation are not necessary for strong cross-domain audio understanding. This work demonstrates the practicality and reproducibility of open-resource ALMs, enabling scalable development of audio-language reasoning with transparent training and evaluation.

Abstract

Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored despite audio's centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data, less than 30K hours (5K unique), Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors are not required for strong performance, even compared to models trained on over 500K hours of data.

Paper Structure

This paper contains 32 sections, 7 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Architecture of Falcon3-Audio which integrates Whisper audio encoder to extract features, and projects audio tokens into the Instruct LLM input space via a learnable projector.
  • Figure 2: Prompt template used for multi-modal instruction fine-tuning. The <|AUDIO|> token is replaced with audio features during fine-tuning.