Eureka-Audio: Triggering Audio Intelligence in Compact Language Models

Dan Zhang; Yishu Lei; Jing Hu; Shuwei He; Songhe Deng; Xianlong Luo; Danxiang Zhu; Shikun Feng; Rui Liu; Jingzhou He; Yu Sun; Hua Wu; Haifeng Wang

Eureka-Audio: Triggering Audio Intelligence in Compact Language Models

Dan Zhang, Yishu Lei, Jing Hu, Shuwei He, Songhe Deng, Xianlong Luo, Danxiang Zhu, Shikun Feng, Rui Liu, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang

TL;DR

Eureka-Audio is presented, a compact yet high-performance audio language model that achieves competitive performance against models that are 4 to 18 times larger across a broad range of audio understanding benchmarks, and DataFlux, a closed loop audio instruction data synthesis and verification pipeline that constructs high quality, logically consistent supervision from raw audio.

Abstract

We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against models that are 4 to 18 times larger across a broad range of audio understanding benchmarks. Despite containing only 1.7B parameters, Eureka-Audio demonstrates strong performance on automatic speech recognition (ASR), audio understanding, and dense audio captioning, matching or surpassing multiple 7B to 30B audio and omni-modal baselines. The model adopts a unified end-to-end architecture composed of a lightweight language backbone, a Whisper-based audio encoder, and a sparsely activated Mixture-of-Experts (MoE) adapter that explicitly accounts for audio heterogeneity and alleviates cross-modal optimization conflicts under limited capacity. To further enhance paralinguistic reasoning, we introduce DataFlux, a closed loop audio instruction data synthesis and verification pipeline that constructs high quality, logically consistent supervision from raw audio. Extensive evaluations across ASR, knowledge reasoning, safety, instruction following, and paralinguistic benchmarks, demonstrate that Eureka-Audio achieves an efficient balance between computational cost and performance. These results establish Eureka Audio as a strong and practical baseline for lightweight audio understanding models.

Eureka-Audio: Triggering Audio Intelligence in Compact Language Models

TL;DR

Abstract

Paper Structure (33 sections, 4 equations, 4 figures, 7 tables)

This paper contains 33 sections, 4 equations, 4 figures, 7 tables.

Introduction
Related Work
Large Audio Language Model
Lightweight Multimodal Models
Architecture
Overview
Audio Encoder
Sparse MoE Adapter.
Language Model Backbone.
Sparse MoE Adapter
Training Objective.
Pretraining
Stage 1 (Alignment Stage).
Stage 2 (Joint Pretraining Stage).
Task Formulation
...and 18 more sections

Figures (4)

Figure 1: Comparison of Eureka-Audio with open-source audio-language and omni-modal baselines. (a) On the MMAU benchmark, Eureka-Audio (1.7B) achieves a score of 74.67, competitive with models 4--17$\times$ larger. (b) Eureka-Audio achieves the highest decode throughput of 269.7 tokens/sec among the compared models.
Figure 2: The overview of the Eureka-Audio. Eureka-Audio adopts a unified end-to-end design consisting of three core components: (1) a Whisper-based audio encoder that encodes raw waveforms into high–temporal-resolution acoustic representations; (2) a sparse MoE adaptermoe-adapter that maps acoustic features into the language model embedding space for efficient cross-modal alignment; and (3) a lightweight language model backbone (Qwen3-1.7B-base Qwen3) that jointly models aligned audio embeddings and text tokens in an autoregressive manner to support diverse audio understanding tasks.
Figure 3: Overview of DataFlux. Starting from raw audio, DataFlux constructs high-quality paralinguistic instruction data through a three-step workflow: (1) Query--Choice Generation, where dense audio captions are first produced and then transformed into structured Query--Choice pairs using a predefined paralinguistic taxonomy and few-shot exemplars; (2) Answer Generation, where multiple audio large language models generate reasoning traces and answers conditioned on the same audio and queries; and (3) Answer Verification, where an automated judge evaluates multi-model outputs based on logical consistency and alignment with the audio content, retaining reliable samples while filtering noisy or inconsistent ones.
Figure 4: Decode throughput versus model size. Eureka-Audio-Instruct (1.7B) achieves the fastest inference at 269.7 tokens/sec, 3.7$\times$ faster than Qwen3-Omni-A3B while being 17$\times$ smaller, highlighting its lightweight and efficient design.

Eureka-Audio: Triggering Audio Intelligence in Compact Language Models

TL;DR

Abstract

Eureka-Audio: Triggering Audio Intelligence in Compact Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)