Table of Contents
Fetching ...

Parameter-Efficient Transfer Learning for Music Foundation Models

Yiwei Ding, Alexander Lerch

TL;DR

This paper tackles the challenge of transferring music foundation models to downstream tasks without full fine-tuning. It introduces parameter-efficient transfer learning (PETL) with adapter-based, prompt-based, and reparameterization-based methods to train only a small fraction of parameters. Evaluations on two music foundation models (MusicFM and MERT) across auto-tagging, key detection, and tempo estimation show PETL consistently outperforms probing and often matches or nears full fine-tuning, with substantial reductions in training cost. However, for key and tempo tasks, full fine-tuning sometimes remains superior, and results raise questions about the current generation of music foundation models for those tasks; the work provides a practical, resource-efficient toolkit and release code.

Abstract

More music foundation models are recently being released, promising a general, mostly task independent encoding of musical information. Common ways of adapting music foundation models to downstream tasks are probing and fine-tuning. These common transfer learning approaches, however, face challenges. Probing might lead to suboptimal performance because the pre-trained weights are frozen, while fine-tuning is computationally expensive and is prone to overfitting. Our work investigates the use of parameter-efficient transfer learning (PETL) for music foundation models which integrates the advantage of probing and fine-tuning. We introduce three types of PETL methods: adapter-based methods, prompt-based methods, and reparameterization-based methods. These methods train only a small number of parameters, and therefore do not require significant computational resources. Results show that PETL methods outperform both probing and fine-tuning on music auto-tagging. On key detection and tempo estimation, they achieve similar results as fine-tuning with significantly less training cost. However, the usefulness of the current generation of foundation model on key and tempo tasks is questioned by the similar results achieved by training a small model from scratch. Code available at https://github.com/suncerock/peft-music/

Parameter-Efficient Transfer Learning for Music Foundation Models

TL;DR

This paper tackles the challenge of transferring music foundation models to downstream tasks without full fine-tuning. It introduces parameter-efficient transfer learning (PETL) with adapter-based, prompt-based, and reparameterization-based methods to train only a small fraction of parameters. Evaluations on two music foundation models (MusicFM and MERT) across auto-tagging, key detection, and tempo estimation show PETL consistently outperforms probing and often matches or nears full fine-tuning, with substantial reductions in training cost. However, for key and tempo tasks, full fine-tuning sometimes remains superior, and results raise questions about the current generation of music foundation models for those tasks; the work provides a practical, resource-efficient toolkit and release code.

Abstract

More music foundation models are recently being released, promising a general, mostly task independent encoding of musical information. Common ways of adapting music foundation models to downstream tasks are probing and fine-tuning. These common transfer learning approaches, however, face challenges. Probing might lead to suboptimal performance because the pre-trained weights are frozen, while fine-tuning is computationally expensive and is prone to overfitting. Our work investigates the use of parameter-efficient transfer learning (PETL) for music foundation models which integrates the advantage of probing and fine-tuning. We introduce three types of PETL methods: adapter-based methods, prompt-based methods, and reparameterization-based methods. These methods train only a small number of parameters, and therefore do not require significant computational resources. Results show that PETL methods outperform both probing and fine-tuning on music auto-tagging. On key detection and tempo estimation, they achieve similar results as fine-tuning with significantly less training cost. However, the usefulness of the current generation of foundation model on key and tempo tasks is questioned by the similar results achieved by training a small model from scratch. Code available at https://github.com/suncerock/peft-music/

Paper Structure

This paper contains 22 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of parameter-efficient transfer learning methods: (a) Adapter, (b) Prompt-tuning, (c) Prefix-tuning, (d) BitFit, (e) SSF and (f) LoRA.
  • Figure 2: Training and inference time of different methods in comparison with full-parameter fine-tuning. Value being bigger than 1 means it is slower than full-parameter fine-tuning and vice versa. Note that two plots are in different scales.
  • Figure 3: Results on MagnaTagATune dataset with the two foundation models and different approaches. Missing results for fine-tuning indicates that full-parameter fine-tuning requires more than 24GB VRAM and cannot be done on our GPU.