ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition

Muhammad Waseem Akram; Stefano Dettori; Valentina Colla; Giorgio Carlo Buttazzo

ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition

Muhammad Waseem Akram, Stefano Dettori, Valentina Colla, Giorgio Carlo Buttazzo

TL;DR

ChordFormer addresses large-vocabulary automatic chord recognition by fusing Conformer blocks (convolutional and self-attention components) with a structured chord label representation and a reweighted loss to mitigate long-tail class imbalance. A CRF-based decoding strategy enforces temporal coherence while controlling chord vocabulary. Empirical results on the Humphrey–Bello dataset show improvements over state-of-the-art baselines in both frame-wise and class-wise accuracy, particularly for rare chords, and demonstrate the practical value of integrating musical-theoretic structure into chord prediction. The approach advances robust, interpretable large-vocabulary chord recognition with potential extensions to adaptive reweighting and self-supervised learning.

Abstract

Chord recognition serves as a critical task in music information retrieval due to the abstract and descriptive nature of chords in music analysis. While audio chord recognition systems have achieved significant accuracy for small vocabularies (e.g., major/minor chords), large-vocabulary chord recognition remains a challenging problem. This complexity also arises from the inherent long-tail distribution of chords, where rare chord types are underrepresented in most datasets, leading to insufficient training samples. Effective chord recognition requires leveraging contextual information from audio sequences, yet existing models, such as combinations of convolutional neural networks, bidirectional long short-term memory networks, and bidirectional transformers, face limitations in capturing long-term dependencies and exhibit suboptimal performance on large-vocabulary chord recognition tasks. This work proposes ChordFormer, a novel conformer-based architecture designed to tackle structural chord recognition (e.g., triads, bass, sevenths) for large vocabularies. ChordFormer leverages conformer blocks that integrate convolutional neural networks with transformers, thus enabling the model to capture both local patterns and global dependencies effectively. By addressing challenges such as class imbalance through a reweighted loss function and structured chord representations, ChordFormer outperforms state-of-the-art models, achieving a 2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling class imbalance, providing robust and balanced recognition across chord types. This approach bridges the gap between theoretical music knowledge and practical applications, advancing the field of large-vocabulary chord recognition.

ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition

TL;DR

Abstract

ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)