Table of Contents
Fetching ...

PiCoGen2: Piano cover generation with transfer learning approach and weakly aligned data

Chih-Pin Tan, Hsin Ai, Yi-Hsin Chang, Shuen-Huei Guan, Yi-Hsuan Yang

TL;DR

PiCoGen2 addresses piano cover generation by combining a two-stage transfer-learning framework with weak beat-level alignment to avoid rhythmic distortion inherent in note remapping. A frozen lead-sheet encoder (SheetSage) provides high-level musical features during pre-training on piano-only data, which are then transferred to the song domain through fine-tuning on weakly-aligned song-to-piano pairs. The approach uses a decoder-only Transformer conditioned on a prior-encoded representation of the input song, with REMI-like tokenization to model piano performance. Empirical results across five pop genres show improvement in subjective quality and competitive objective metrics, outperforming baselines and validating the effectiveness of weak alignment and transfer learning for piano-cover generation.

Abstract

Piano cover generation aims to create a piano cover from a pop song. Existing approaches mainly employ supervised learning and the training demands strongly-aligned and paired song-to-piano data, which is built by remapping piano notes to song audio. This would, however, result in the loss of piano information and accordingly cause inconsistencies between the original and remapped piano versions. To overcome this limitation, we propose a transfer learning approach that pre-trains our model on piano-only data and fine-tunes it on weakly-aligned paired data constructed without note remapping. During pre-training, to guide the model to learn piano composition concepts instead of merely transcribing audio, we use an existing lead sheet transcription model as the encoder to extract high-level features from the piano recordings. The pre-trained model is then fine-tuned on the paired song-piano data to transfer the learned composition knowledge to the pop song domain. Our evaluation shows that this training strategy enables our model, named PiCoGen2, to attain high-quality results, outperforming baselines on both objective and subjective metrics across five pop genres.

PiCoGen2: Piano cover generation with transfer learning approach and weakly aligned data

TL;DR

PiCoGen2 addresses piano cover generation by combining a two-stage transfer-learning framework with weak beat-level alignment to avoid rhythmic distortion inherent in note remapping. A frozen lead-sheet encoder (SheetSage) provides high-level musical features during pre-training on piano-only data, which are then transferred to the song domain through fine-tuning on weakly-aligned song-to-piano pairs. The approach uses a decoder-only Transformer conditioned on a prior-encoded representation of the input song, with REMI-like tokenization to model piano performance. Empirical results across five pop genres show improvement in subjective quality and competitive objective metrics, outperforming baselines and validating the effectiveness of weak alignment and transfer learning for piano-cover generation.

Abstract

Piano cover generation aims to create a piano cover from a pop song. Existing approaches mainly employ supervised learning and the training demands strongly-aligned and paired song-to-piano data, which is built by remapping piano notes to song audio. This would, however, result in the loss of piano information and accordingly cause inconsistencies between the original and remapped piano versions. To overcome this limitation, we propose a transfer learning approach that pre-trains our model on piano-only data and fine-tunes it on weakly-aligned paired data constructed without note remapping. During pre-training, to guide the model to learn piano composition concepts instead of merely transcribing audio, we use an existing lead sheet transcription model as the encoder to extract high-level features from the piano recordings. The pre-trained model is then fine-tuned on the paired song-piano data to transfer the learned composition knowledge to the pop song domain. Our evaluation shows that this training strategy enables our model, named PiCoGen2, to attain high-quality results, outperforming baselines on both objective and subjective metrics across five pop genres.
Paper Structure (16 sections, 2 equations, 5 figures, 2 tables)

This paper contains 16 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The proposed model is trained with two stages: firstly pre-trained on piano-only data and then fine-tuned on the weakly-aligned song-to-piano pairs.
  • Figure 2: A diagram of the proposed model, PiCoGen2. The fire and snowflake symbols indicate the trainable and frozen parts. For example, the parameters for SheetSage donahue2022melody, a model pre-trained for lead sheet transcription, are always frozen.
  • Figure 3: The MOS in overall scores (OVL) of the user study in different genres.
  • Figure 4: The pianoroll representation of a snippet from an example generated by the models. We observe that Ablation 2, which trained on piano-only data, tends to generate repeated short notes.
  • Figure :