Table of Contents
Fetching ...

BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference

Sungjae Kim, Kihyun Na, Jinyoung Choi, Injung Kim

TL;DR

BERT-APC introduces a reference-free automatic pitch correction framework that bridges continuous vocal pitch with symbolic musical context by repurposing MusicBERT as a context prior. The system combines a note segmentator, a stationary pitch predictor that estimates perceptual note centers, and a context-aware pitch predictor to produce musically coherent target pitches, augmented by a learnable detuner for realistic training data. Empirical results show superior note-pitch accuracy (RPA) and stationary pitch alignment (PTR/MAE) compared with SVT baselines, and significantly higher MOS than AutoTune and Melodyne while preserving expressive nuances. This work demonstrates that incorporating symbolic musical context into APC yields robust corrections capable of handling large detunings and improves perceptual quality, marking a first reference-free approach to pitch correction with symbolic prior knowledge.

Abstract

Automatic Pitch Correction (APC) enhances vocal recordings by aligning pitch deviations with the intended musical notes. However, existing APC systems either rely on reference pitches, which limits their practical applicability, or employ simple pitch estimation algorithms that often fail to preserve expressiveness and naturalness. We propose BERT-APC, a novel reference-free APC framework that corrects pitch errors while maintaining the natural expressiveness of vocal performances. In BERT-APC, a novel stationary pitch predictor first estimates the perceived pitch of each note from the detuned singing voice. A context-aware note pitch predictor estimates the intended pitch sequence by leveraging a music language model repurposed to incorporate musical context. Finally, a note-level correction algorithm fixes pitch errors while preserving intentional pitch deviations for emotional expression. In addition, we introduce a learnable data augmentation strategy that improves the robustness of the music language model by simulating realistic detuning patterns. Compared to two recent singing voice transcription models, BERT-APC demonstrated superior performance in note pitch prediction, outperforming the second-best model, ROSVOT, by 10.49%p on highly detuned samples in terms of the raw pitch accuracy. In the MOS test, BERT-APC achieved the highest score of $4.32 \pm 0.15$, which is significantly higher than those of the widely-used commercial APC tools, AutoTune ($3.22 \pm 0.18$) and Melodyne ($3.08 \pm 0.18$), while maintaining a comparable ability to preserve expressive nuances. To the best of our knowledge, this is the first APC model that leverages a music language model to achieve reference-free pitch correction with symbolic musical context. The corrected audio samples of BERT-APC are available online.

BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference

TL;DR

BERT-APC introduces a reference-free automatic pitch correction framework that bridges continuous vocal pitch with symbolic musical context by repurposing MusicBERT as a context prior. The system combines a note segmentator, a stationary pitch predictor that estimates perceptual note centers, and a context-aware pitch predictor to produce musically coherent target pitches, augmented by a learnable detuner for realistic training data. Empirical results show superior note-pitch accuracy (RPA) and stationary pitch alignment (PTR/MAE) compared with SVT baselines, and significantly higher MOS than AutoTune and Melodyne while preserving expressive nuances. This work demonstrates that incorporating symbolic musical context into APC yields robust corrections capable of handling large detunings and improves perceptual quality, marking a first reference-free approach to pitch correction with symbolic prior knowledge.

Abstract

Automatic Pitch Correction (APC) enhances vocal recordings by aligning pitch deviations with the intended musical notes. However, existing APC systems either rely on reference pitches, which limits their practical applicability, or employ simple pitch estimation algorithms that often fail to preserve expressiveness and naturalness. We propose BERT-APC, a novel reference-free APC framework that corrects pitch errors while maintaining the natural expressiveness of vocal performances. In BERT-APC, a novel stationary pitch predictor first estimates the perceived pitch of each note from the detuned singing voice. A context-aware note pitch predictor estimates the intended pitch sequence by leveraging a music language model repurposed to incorporate musical context. Finally, a note-level correction algorithm fixes pitch errors while preserving intentional pitch deviations for emotional expression. In addition, we introduce a learnable data augmentation strategy that improves the robustness of the music language model by simulating realistic detuning patterns. Compared to two recent singing voice transcription models, BERT-APC demonstrated superior performance in note pitch prediction, outperforming the second-best model, ROSVOT, by 10.49%p on highly detuned samples in terms of the raw pitch accuracy. In the MOS test, BERT-APC achieved the highest score of , which is significantly higher than those of the widely-used commercial APC tools, AutoTune () and Melodyne (), while maintaining a comparable ability to preserve expressive nuances. To the best of our knowledge, this is the first APC model that leverages a music language model to achieve reference-free pitch correction with symbolic musical context. The corrected audio samples of BERT-APC are available online.

Paper Structure

This paper contains 26 sections, 12 equations, 15 figures, 5 tables, 1 algorithm.

Figures (15)

  • Figure 1: Model architecture of BERT-APC. The system operates in three stages—note-level feature extraction, context-aware note pitch estimation, and note-level pitch correction. A concise step-by-step overview is provided in the blue box on the right.
  • Figure 3: The architecture of the context-aware note pitch predictor that is based on the symbolic music language model, MusicBERT.
  • Figure 5: Visualization of pitch correction results for a highly detuned sample. The green, blue, and orange lines represent the correction results, the input pitch, and the GT note pitch, respectively. (a) AutoTune and (b) Melodyne failed to correct pitch deviations exceeding one semitone, especially when the deviation spanned the full pitch range of a note. (c) In contrast, BERT-APC successfully corrected them by leveraging musical context via the musical language model, MusicBERT.
  • Figure : (a) Stationary Pitch Predictor
  • Figure : (a) In-tune subset
  • ...and 10 more figures