Table of Contents
Fetching ...

Mellow: a small audio language model for reasoning

Soham Deshmukh, Satvik Dixit, Rita Singh, Bhiksha Raj

TL;DR

This work addresses the challenge of enabling reasoning in small Audio-Language Models by introducing Mellow, a sub-billion-parameter ALM trained for reasoning over both audio and text. The authors create ReasonAQA, a reasoning-focused dataset that blends real audio-reasoning data with large-language-model-generated synthetic QA grounded in audio captions, and evaluate Mellow across understanding, deductive, and comparative tasks. They show that Mellow achieves state-of-the-art performance among small ALMs on MMAU and competes with significantly larger models while using far fewer parameters and less training data, highlighting the potential for edge-device audio reasoning. Ablation studies identify crucial factors—LM pretraining, projection layer design, synthetic data strategies, and audio encoder choices—that drive reasoning performance in small ALMs, offering a practical roadmap for efficient on-device audio-grounded reasoning systems.

Abstract

Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs). To train Mellow, we introduce ReasonAQA, a dataset designed to enhance audio-grounded reasoning in models. It consists of a mixture of existing datasets (30% of the data) and synthetically generated data (70%). The synthetic dataset is derived from audio captioning datasets, where Large Language Models (LLMs) generate detailed and multiple-choice questions focusing on audio events, objects, acoustic scenes, signal properties, semantics, and listener emotions. To evaluate Mellow's reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.

Mellow: a small audio language model for reasoning

TL;DR

This work addresses the challenge of enabling reasoning in small Audio-Language Models by introducing Mellow, a sub-billion-parameter ALM trained for reasoning over both audio and text. The authors create ReasonAQA, a reasoning-focused dataset that blends real audio-reasoning data with large-language-model-generated synthetic QA grounded in audio captions, and evaluate Mellow across understanding, deductive, and comparative tasks. They show that Mellow achieves state-of-the-art performance among small ALMs on MMAU and competes with significantly larger models while using far fewer parameters and less training data, highlighting the potential for edge-device audio reasoning. Ablation studies identify crucial factors—LM pretraining, projection layer design, synthetic data strategies, and audio encoder choices—that drive reasoning performance in small ALMs, offering a practical roadmap for efficient on-device audio-grounded reasoning systems.

Abstract

Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs). To train Mellow, we introduce ReasonAQA, a dataset designed to enhance audio-grounded reasoning in models. It consists of a mixture of existing datasets (30% of the data) and synthetically generated data (70%). The synthetic dataset is derived from audio captioning datasets, where Large Language Models (LLMs) generate detailed and multiple-choice questions focusing on audio events, objects, acoustic scenes, signal properties, semantics, and listener emotions. To evaluate Mellow's reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.

Paper Structure

This paper contains 40 sections, 3 equations, 12 figures, 19 tables.

Figures (12)

  • Figure 1: The left plot shows the performance of different ALMs on MMAU vs their parameter size. We plot models whose parameter counts are known, (I) indicates Instruction-tuned, (C) indicates Chat. The right figure shows Mellow's different capabilities and examples, the full examples are available in Figure \ref{['fig:full examples']}.
  • Figure 2: Mellow takes two audio recordings and a text prompt as input and generates a text output. The two audio inputs are encoded by an audio encoder and projected into the language model space using a mapper (map). Simultaneously, the text prompt is embedded by the text embedder. The audio projection 1, audio projection 2, and text embedding are concatenated to form the prefix. During concatenation, a separator token (s) is inserted between the three components (as shown in Fig. \ref{['fig:mellow_arch']}). The prefix is then used to prompt the small Language Model, which generates a natural language response.
  • Figure 3: Absolute difference in accuracy between performance with real audio input and performance with Gaussian noise.
  • Figure 4: The data generation pipeline for creating the training data of ReasonAQA consists of three main steps. First, audio captions are sampled from the AudioCaps audiocaps and Clotho clotho datasets. Next, these audio captions are inserted into detailed and multiple-choice (MCQ) templates to construct text prompts. Finally, these text prompts are used to query a large language model (LLM), which generates detailed and MCQ-based audio question-answer pairs.
  • Figure 5: LLM system prompt used to generate MCQ and descriptive questions for ReasonAQA. The "user prompt detail" and "user prompt mcq" shows the prompt used to generate the detailed and MCQ audio question-answer pairs respectively. In Ablation Table \ref{['tab:appendix_ablation_study']}, this is referred to as Type 1 data generation.
  • ...and 7 more figures