Table of Contents
Fetching ...

Music Foundation Model as Generic Booster for Music Downstream Tasks

WeiHsiang Liao, Yuhta Takida, Yukara Ikemiya, Zhi Zhong, Chieh-Hsin Lai, Giorgio Fabbro, Kazuki Shimada, Keisuke Toyama, Kinwai Cheuk, Marco A. Martínez-Ramírez, Shusuke Takahashi, Stefan Uhlich, Taketo Akama, Woosung Choi, Yuichiro Koyama, Yuki Mitsufuji

TL;DR

This work addresses the challenge of boosting diverse music downstream tasks with a single foundation model. It introduces SoniDo, a two-stage hierarchical architecture combining HQ-VAE stage-1 with stage-2 sparse transformers to produce transferable, multi-level tokens (SoniDo features) that serve as task-agnostic boosters. Experimental results show that SoniDo features improve performance across music tagging, transcription, source separation, and mixing, including improvements when data are scarce and gains in task-specific models. The approach offers a data-efficient, scalable path for integrating foundation-model representations into music processing pipelines, while highlighting the need for bias analysis in future work.

Abstract

We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.

Music Foundation Model as Generic Booster for Music Downstream Tasks

TL;DR

This work addresses the challenge of boosting diverse music downstream tasks with a single foundation model. It introduces SoniDo, a two-stage hierarchical architecture combining HQ-VAE stage-1 with stage-2 sparse transformers to produce transferable, multi-level tokens (SoniDo features) that serve as task-agnostic boosters. Experimental results show that SoniDo features improve performance across music tagging, transcription, source separation, and mixing, including improvements when data are scarce and gains in task-specific models. The approach offers a data-efficient, scalable path for integrating foundation-model representations into music processing pipelines, while highlighting the need for bias analysis in future work.

Abstract

We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.

Paper Structure

This paper contains 42 sections, 3 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: SoniDo extracts hierarchical features of target music samples, which are useful for solving music downstream tasks including understanding and generative tasks.
  • Figure 2: The two stages of SoniDo.
  • Figure 3: Stage-1 model comparison.
  • Figure 4: Attention-based feature aggregation and token-out data augmentation. "T", "M", "B" mean top, middle, and bottom priors, respectively. Token-out augmentation deletes masked tokens from input sequence. Attention block aggregates sequence into single vector and is followed by MLP to predict tags.
  • Figure 5: Model architectures of linear music transcription for piano
  • ...and 9 more figures