Table of Contents
Fetching ...

InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation

Chong Zhang, Yukun Ma, Qian Chen, Wen Wang, Shengkui Zhao, Zexu Pan, Hao Wang, Chongjia Ni, Trung Hieu Nguyen, Kun Zhou, Yidi Jiang, Chaohong Tan, Zhifu Gao, Zhihao Du, Bin Ma

TL;DR

InspireMusic tackles the challenge of long-form, high-fidelity music generation by integrating an autoregressive transformer with a super-resolution flow-matching model. The system jointly leverages an ultra-efficient audio tokenizer (WavTokenizer) and a high-fidelity HiFi-Codec-based upsampling pathway to produce 48kHz audio from 24kHz tokens, enabling coherent outputs up to 8 minutes. Built on the Qwen 2.5 LLM, the approach enables controllable generation from text or audio prompts and delivers competitive objective and subjective performance against state-of-the-art open-source systems like MusicGen and Stable Audio 2.0. The work demonstrates a scalable, flexible pipeline that bridges language-driven generation and high-fidelity audio synthesis, with public release of code and models.

Abstract

We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.

InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation

TL;DR

InspireMusic tackles the challenge of long-form, high-fidelity music generation by integrating an autoregressive transformer with a super-resolution flow-matching model. The system jointly leverages an ultra-efficient audio tokenizer (WavTokenizer) and a high-fidelity HiFi-Codec-based upsampling pathway to produce 48kHz audio from 24kHz tokens, enabling coherent outputs up to 8 minutes. Built on the Qwen 2.5 LLM, the approach enables controllable generation from text or audio prompts and delivers competitive objective and subjective performance against state-of-the-art open-source systems like MusicGen and Stable Audio 2.0. The work demonstrates a scalable, flexible pipeline that bridges language-driven generation and high-fidelity audio synthesis, with public release of code and models.

Abstract

We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.

Paper Structure

This paper contains 27 sections, 2 figures, 11 tables.

Figures (2)

  • Figure 1: The overview of InspireMusic framework. InspireMusic is composed of a audio tokenizers, an autoregressive transformer, and a super-resolution flow-matching model. (1) Audio waveform of lower sampling rate has converted to discrete tokens via a high bitrate compression audio tokenizer. (2) The audio and text tokens are the inputs of an autoregressive model with the next token prediction to generate tokens. (3) Then the flow-matching model maps the generated tokens to the latent features with high-resolution fine-grained acoustic details obtained via Hifi-Codec (yang2023hificodec) from a higher sampling rate of audio to ensure the acoustic information flow connected with high fidelity through models. (4) The vocoder decoder then produces high-quality $48kHz$ audio.
  • Figure 2: The statistical distribution of music genres in the pre-trained dataset.