Table of Contents
Fetching ...

ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition

Muhammad Waseem Akram, Stefano Dettori, Valentina Colla, Giorgio Carlo Buttazzo

TL;DR

ChordFormer addresses large-vocabulary automatic chord recognition by fusing Conformer blocks (convolutional and self-attention components) with a structured chord label representation and a reweighted loss to mitigate long-tail class imbalance. A CRF-based decoding strategy enforces temporal coherence while controlling chord vocabulary. Empirical results on the Humphrey–Bello dataset show improvements over state-of-the-art baselines in both frame-wise and class-wise accuracy, particularly for rare chords, and demonstrate the practical value of integrating musical-theoretic structure into chord prediction. The approach advances robust, interpretable large-vocabulary chord recognition with potential extensions to adaptive reweighting and self-supervised learning.

Abstract

Chord recognition serves as a critical task in music information retrieval due to the abstract and descriptive nature of chords in music analysis. While audio chord recognition systems have achieved significant accuracy for small vocabularies (e.g., major/minor chords), large-vocabulary chord recognition remains a challenging problem. This complexity also arises from the inherent long-tail distribution of chords, where rare chord types are underrepresented in most datasets, leading to insufficient training samples. Effective chord recognition requires leveraging contextual information from audio sequences, yet existing models, such as combinations of convolutional neural networks, bidirectional long short-term memory networks, and bidirectional transformers, face limitations in capturing long-term dependencies and exhibit suboptimal performance on large-vocabulary chord recognition tasks. This work proposes ChordFormer, a novel conformer-based architecture designed to tackle structural chord recognition (e.g., triads, bass, sevenths) for large vocabularies. ChordFormer leverages conformer blocks that integrate convolutional neural networks with transformers, thus enabling the model to capture both local patterns and global dependencies effectively. By addressing challenges such as class imbalance through a reweighted loss function and structured chord representations, ChordFormer outperforms state-of-the-art models, achieving a 2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling class imbalance, providing robust and balanced recognition across chord types. This approach bridges the gap between theoretical music knowledge and practical applications, advancing the field of large-vocabulary chord recognition.

ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition

TL;DR

ChordFormer addresses large-vocabulary automatic chord recognition by fusing Conformer blocks (convolutional and self-attention components) with a structured chord label representation and a reweighted loss to mitigate long-tail class imbalance. A CRF-based decoding strategy enforces temporal coherence while controlling chord vocabulary. Empirical results on the Humphrey–Bello dataset show improvements over state-of-the-art baselines in both frame-wise and class-wise accuracy, particularly for rare chords, and demonstrate the practical value of integrating musical-theoretic structure into chord prediction. The approach advances robust, interpretable large-vocabulary chord recognition with potential extensions to adaptive reweighting and self-supervised learning.

Abstract

Chord recognition serves as a critical task in music information retrieval due to the abstract and descriptive nature of chords in music analysis. While audio chord recognition systems have achieved significant accuracy for small vocabularies (e.g., major/minor chords), large-vocabulary chord recognition remains a challenging problem. This complexity also arises from the inherent long-tail distribution of chords, where rare chord types are underrepresented in most datasets, leading to insufficient training samples. Effective chord recognition requires leveraging contextual information from audio sequences, yet existing models, such as combinations of convolutional neural networks, bidirectional long short-term memory networks, and bidirectional transformers, face limitations in capturing long-term dependencies and exhibit suboptimal performance on large-vocabulary chord recognition tasks. This work proposes ChordFormer, a novel conformer-based architecture designed to tackle structural chord recognition (e.g., triads, bass, sevenths) for large vocabularies. ChordFormer leverages conformer blocks that integrate convolutional neural networks with transformers, thus enabling the model to capture both local patterns and global dependencies effectively. By addressing challenges such as class imbalance through a reweighted loss function and structured chord representations, ChordFormer outperforms state-of-the-art models, achieving a 2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling class imbalance, providing robust and balanced recognition across chord types. This approach bridges the gap between theoretical music knowledge and practical applications, advancing the field of large-vocabulary chord recognition.

Paper Structure

This paper contains 20 sections, 15 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the ChordFormer Architecture. The ChordFormer architecture consists of three primary components: (i) Preprocessing Module: Converts audio signals into Constant-Q Transform (CQT) representations to extract low-level time-frequency features essential for chord recognition. (ii) Conformer Block: Processes the CQT features to generate frame-wise activations using a combination of convolutional layers and self-attention mechanisms. This block leverages contextual frames to enhance accuracy in chord prediction. (iii) Decoding Model: Interprets the activations from the Conformer Block and generates the final chord sequence by decoding the structural chord components.
  • Figure 2: Architecture Overview of Key Modules in ChordFormer Model. (i) Feedforward Module: Consists of a pre-normalization layer followed by a linear layer, Swish activation, and a dropout mechanism. Another linear layer projects back to the model dimensions. A residual connection enhances feature propagation. (ii) Multi-Head Self-Attention Module: Incorporates multi-head self-attention with relative positional embeddings to capture global dependencies. Pre-normalization ensures stable gradient flow, and a residual connection is included to preserve input information. (iii) Convolution Module: Features a pointwise convolution with an expansion factor of 2, coupled with a Gated Linear Unit (GLU) activation. This is followed by a 1-D depthwise convolution, batch normalization, and a Swish activation layer. Dropout is applied to ensure regularization, and a residual connection completes the module. (iv) Feedforward Module: Reiterates the feedforward structure with pre-normalization, Swish activation, and dropout for robust learning and dimensionality preservation.
  • Figure 3: Top: chord class appearance in the song dataset; Bottom: ChordFormer recall values for each chord class under varying re-weight factors ($\gamma, w_{\text{max}}$).
  • Figure 4: Confusion matrix of ChordFormer with re-weighting factors ($\gamma = 0.5, w_{\text{max}} = 10$)