Table of Contents
Fetching ...

MuCPT: Music-related Natural Language Model Continued Pretraining

Kai Tian, Yirong Mao, Wendong Bi, Hanjie Wang, Que Wenhui

TL;DR

MuCPT tackles the challenge of domain-specific language models in music by building a large, curated 40B-token music corpus (Matrix-music and WeChat-music) and a domain-first pretraining pipeline. A token-level soft scoring mechanism using a Reference Model normalizes per-token contributions to reduce noise and preserve domain-relevant signals during continued pretraining. Evaluation on MusicSimpleQA shows MuCPT (32B) achieves 0.7759 accuracy, surpassing several larger or instruction-tuned baselines and demonstrating that task-aligned data and objectives can outperform sheer parameter count. The work delivers a scalable, auditable recipe for music-domain LLMs and introduces a reusable evaluation tool for quantifying factual music knowledge.

Abstract

Large language models perform strongly on general tasks but remain constrained in specialized settings such as music, particularly in the music-entertainment domain, where corpus scale, purity, and the match between data and training objectives are critical. We address this by constructing a large, music-related natural language corpus (40B tokens) that combines open source and in-house data, and by implementing a domain-first data pipeline: a lightweight classifier filters and weights in-domain text, followed by multi-stage cleaning, de-duplication, and privacy-preserving masking. We further integrate multi-source music text with associated metadata to form a broader, better-structured foundation of domain knowledge. On the training side, we introduce reference-model (RM)-based token-level soft scoring for quality control: a unified loss-ratio criterion is used both for data selection and for dynamic down-weighting during optimization, reducing noise gradients and amplifying task-aligned signals, thereby enabling more effective music-domain continued pretraining and alignment. To assess factuality, we design the MusicSimpleQA benchmark, which adopts short, single-answer prompts with automated agreement scoring. Beyond the benchmark design, we conduct systematic comparisons along the axes of data composition. Overall, this work advances both the right corpus and the right objective, offering a scalable data-training framework and a reusable evaluation tool for building domain LLMs in the music field.

MuCPT: Music-related Natural Language Model Continued Pretraining

TL;DR

MuCPT tackles the challenge of domain-specific language models in music by building a large, curated 40B-token music corpus (Matrix-music and WeChat-music) and a domain-first pretraining pipeline. A token-level soft scoring mechanism using a Reference Model normalizes per-token contributions to reduce noise and preserve domain-relevant signals during continued pretraining. Evaluation on MusicSimpleQA shows MuCPT (32B) achieves 0.7759 accuracy, surpassing several larger or instruction-tuned baselines and demonstrating that task-aligned data and objectives can outperform sheer parameter count. The work delivers a scalable, auditable recipe for music-domain LLMs and introduces a reusable evaluation tool for quantifying factual music knowledge.

Abstract

Large language models perform strongly on general tasks but remain constrained in specialized settings such as music, particularly in the music-entertainment domain, where corpus scale, purity, and the match between data and training objectives are critical. We address this by constructing a large, music-related natural language corpus (40B tokens) that combines open source and in-house data, and by implementing a domain-first data pipeline: a lightweight classifier filters and weights in-domain text, followed by multi-stage cleaning, de-duplication, and privacy-preserving masking. We further integrate multi-source music text with associated metadata to form a broader, better-structured foundation of domain knowledge. On the training side, we introduce reference-model (RM)-based token-level soft scoring for quality control: a unified loss-ratio criterion is used both for data selection and for dynamic down-weighting during optimization, reducing noise gradients and amplifying task-aligned signals, thereby enabling more effective music-domain continued pretraining and alignment. To assess factuality, we design the MusicSimpleQA benchmark, which adopts short, single-answer prompts with automated agreement scoring. Beyond the benchmark design, we conduct systematic comparisons along the axes of data composition. Overall, this work advances both the right corpus and the right objective, offering a scalable data-training framework and a reusable evaluation tool for building domain LLMs in the music field.

Paper Structure

This paper contains 16 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: A real exemplar short article with noisy token
  • Figure 2: User preferences and corpus composition. (a) shows that most users prefer pop music, motivating our focus on music-entertainment continual pretraining; (b) summarizes the proportional makeup of the open-source music corpus mined for this work.