MiDashengLM: Efficient Audio Understanding with General Audio Captions

Heinrich Dinkel; Gang Li; Jizhong Liu; Jian Luan; Yadong Niu; Xingwei Sun; Tianzi Wang; Qiyang Xiao; Junbo Zhang; Jiahao Zhou

MiDashengLM: Efficient Audio Understanding with General Audio Captions

Heinrich Dinkel, Gang Li, Jizhong Liu, Jian Luan, Yadong Niu, Xingwei Sun, Tianzi Wang, Qiyang Xiao, Junbo Zhang, Jiahao Zhou

TL;DR

MiDashengLM addresses the need for open, general audio-language understanding by replacing ASR-centric pretraining with general audio captions via ACAVCaps and MECAT. The framework uses a Dasheng-based audio encoder aligned to text, followed by pretraining on public data and supervised fine-tuning with LoRA, enabling efficient, variable-length inputs and fast inference. The results show strong cross-domain performance: competitive to baselines on audio captioning, MECAT benchmarks, and paralinguistic tasks, with substantial inference speedups, while ASR performance lags behind closed models but remains robust across languages. The work emphasizes transparency and reproducibility through publicly available datasets and code.

Abstract

Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.

MiDashengLM: Efficient Audio Understanding with General Audio Captions

TL;DR

Abstract

MiDashengLM: Efficient Audio Understanding with General Audio Captions

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)