LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging
Shubhr Singh, Emmanouil Benetos, Huy Phan, Dan Stowell
TL;DR
The paper tackles the limitation of Transformers in audio by introducing Local-Higher Order Graph Neural Networks (LHGNN) that fuse local k-NN structure with higher-order clustering via Fuzzy C-Means. It presents a novel graph kernel and ConvFFN-based blocks that update node features through a max-relative fusion of local and centroid-derived information, followed by a downsampling pipeline. Empirically, LHGNN outperforms Transformer baselines on Audioset, FSD50K, and ESC-50 with fewer parameters, and shows notable gains when pretraining data is scarce. The work demonstrates a practical, data-efficient framework for audio classification and tagging, with potential for broader audio understanding tasks and multi-scale relational modeling.
Abstract
Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, this work introduces the Local- Higher Order Graph Neural Network (LHGNN), a graph based model that enhances feature understanding by integrating local neighbourhood information with higher-order data from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio relationships. Evaluation of the model on three publicly available audio datasets shows that it outperforms Transformer-based models across all benchmarks while operating with substantially fewer parameters. Moreover, LHGNN demonstrates a distinct advantage in scenarios lacking ImageNet pretraining, establishing its effectiveness and efficiency in environments where extensive pretraining data is unavailable.
