Table of Contents
Fetching ...

LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging

Shubhr Singh, Emmanouil Benetos, Huy Phan, Dan Stowell

TL;DR

The paper tackles the limitation of Transformers in audio by introducing Local-Higher Order Graph Neural Networks (LHGNN) that fuse local k-NN structure with higher-order clustering via Fuzzy C-Means. It presents a novel graph kernel and ConvFFN-based blocks that update node features through a max-relative fusion of local and centroid-derived information, followed by a downsampling pipeline. Empirically, LHGNN outperforms Transformer baselines on Audioset, FSD50K, and ESC-50 with fewer parameters, and shows notable gains when pretraining data is scarce. The work demonstrates a practical, data-efficient framework for audio classification and tagging, with potential for broader audio understanding tasks and multi-scale relational modeling.

Abstract

Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, this work introduces the Local- Higher Order Graph Neural Network (LHGNN), a graph based model that enhances feature understanding by integrating local neighbourhood information with higher-order data from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio relationships. Evaluation of the model on three publicly available audio datasets shows that it outperforms Transformer-based models across all benchmarks while operating with substantially fewer parameters. Moreover, LHGNN demonstrates a distinct advantage in scenarios lacking ImageNet pretraining, establishing its effectiveness and efficiency in environments where extensive pretraining data is unavailable.

LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging

TL;DR

The paper tackles the limitation of Transformers in audio by introducing Local-Higher Order Graph Neural Networks (LHGNN) that fuse local k-NN structure with higher-order clustering via Fuzzy C-Means. It presents a novel graph kernel and ConvFFN-based blocks that update node features through a max-relative fusion of local and centroid-derived information, followed by a downsampling pipeline. Empirically, LHGNN outperforms Transformer baselines on Audioset, FSD50K, and ESC-50 with fewer parameters, and shows notable gains when pretraining data is scarce. The work demonstrates a practical, data-efficient framework for audio classification and tagging, with potential for broader audio understanding tasks and multi-scale relational modeling.

Abstract

Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, this work introduces the Local- Higher Order Graph Neural Network (LHGNN), a graph based model that enhances feature understanding by integrating local neighbourhood information with higher-order data from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio relationships. Evaluation of the model on three publicly available audio datasets shows that it outperforms Transformer-based models across all benchmarks while operating with substantially fewer parameters. Moreover, LHGNN demonstrates a distinct advantage in scenarios lacking ImageNet pretraining, establishing its effectiveness and efficiency in environments where extensive pretraining data is unavailable.
Paper Structure (22 sections, 5 equations, 1 figure, 5 tables)

This paper contains 22 sections, 5 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Architecture of LHGNN: Input mel-spectrogram is processed through a convolution block and sent to LHG blocks. In each of the LHG blocks, (a single node) is updated through first constructing a k-NN graph and simulataneously conducting Fuzzy C-Means. The local (k-NN graph) and higher order (cluster centers from Fuzzy C-Means) are fused together to update , followed by a graph convolution and subsequently sent to ConvFFN block. DWConv in the ConvFFN block refers to Depthwise Convolution. $L$ represents the number of repetitions for the LHG blocks.