Table of Contents
Fetching ...

Layer-wise Investigation of Large-Scale Self-Supervised Music Representation Models

Yizhi Zhou, Haina Zhu, Hangting Chen

TL;DR

This work investigates layer-wise information in large-scale self-supervised music models by analyzing MusicFM and MuQ across 14 downstream tasks using the MARBLE benchmark. It employs layer-wise embeddings and PWCCA to track how representations evolve from acoustic to semantic content, and compares layer-scanning versus all-layers approaches for downstream inputs. The findings show SSL representations outperform low-level features across most tasks, with optimal performance typically occurring in middle layers and a clear shift from acoustic to semantic information as depth increases. The study provides practical insights for layer selection and suggests directions for architectural and objective-focused improvements in music SSL models.

Abstract

Recently, pre-trained models for music information retrieval based on self-supervised learning (SSL) are becoming popular, showing success in various downstream tasks. However, there is limited research on the specific meanings of the encoded information and their applicability. Exploring these aspects can help us better understand their capabilities and limitations, leading to more effective use in downstream tasks. In this study, we analyze the advanced music representation model MusicFM and the newly emerged SSL model MuQ. We focus on three main aspects: (i) validating the advantages of SSL models across multiple downstream tasks, (ii) exploring the specialization of layer-wise information for different tasks, and (iii) comparing performance differences when selecting specific layers. Through this analysis, we reveal insights into the structure and potential applications of SSL models in music information retrieval.

Layer-wise Investigation of Large-Scale Self-Supervised Music Representation Models

TL;DR

This work investigates layer-wise information in large-scale self-supervised music models by analyzing MusicFM and MuQ across 14 downstream tasks using the MARBLE benchmark. It employs layer-wise embeddings and PWCCA to track how representations evolve from acoustic to semantic content, and compares layer-scanning versus all-layers approaches for downstream inputs. The findings show SSL representations outperform low-level features across most tasks, with optimal performance typically occurring in middle layers and a clear shift from acoustic to semantic information as depth increases. The study provides practical insights for layer selection and suggests directions for architectural and objective-focused improvements in music SSL models.

Abstract

Recently, pre-trained models for music information retrieval based on self-supervised learning (SSL) are becoming popular, showing success in various downstream tasks. However, there is limited research on the specific meanings of the encoded information and their applicability. Exploring these aspects can help us better understand their capabilities and limitations, leading to more effective use in downstream tasks. In this study, we analyze the advanced music representation model MusicFM and the newly emerged SSL model MuQ. We focus on three main aspects: (i) validating the advantages of SSL models across multiple downstream tasks, (ii) exploring the specialization of layer-wise information for different tasks, and (iii) comparing performance differences when selecting specific layers. Through this analysis, we reveal insights into the structure and potential applications of SSL models in music information retrieval.

Paper Structure

This paper contains 12 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Model Framework, illustrates the paradigm of using SSL models for downstream tasks.
  • Figure 2: Layer-wise results on MusicFM and MuQ model, the red numbers indicate the best-performing layers in this column for each task.
  • Figure 3: PWCCA scores between the intermediate layer representations and the model input representations