Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

Zirui Ge; Xinzhou Xu; Haiyan Guo; Tingting Wang; Zhen Yang

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

Zirui Ge, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang

TL;DR

This paper tackles pooling in speaker recognition using self-supervised representations by introducing IsoGAT, a three-part architecture that fuses representation learning from wav2vec 2.0, a GAT-based graph attention module with cosine similarity, and an injective aggregation scheme inspired by GIN. The approach directly addresses fixed pooling and non-injective aggregation, yielding speaker-discriminative embeddings and improved EERs on VoxCeleb1&2 compared to multiple baselines. Key contributions include a cosine-based GAT for low-level component weighting, an injective aggregation mechanism, and a two-stage training protocol that leverages both last-layer and all-layer representations. The results indicate that IsoGAT provides a robust pooling strategy for SR with self-supervised representations, with potential for temporal fusion and cross-task knowledge transfer.

Abstract

The emergence of self-supervised representation (i.e., wav2vec 2.0) allows speaker-recognition approaches to process spoken signals through foundation models built on speech data. Nevertheless, effective fusion on the representation requires further investigating, due to the inclusion of fixed or sub-optimal temporal pooling strategies. Despite of improved strategies considering graph learning and graph attention factors, non-injective aggregation still exists in the approaches, which may influence the performance for speaker recognition. In this regard, we propose a speaker recognition approach using Isomorphic Graph ATtention network (IsoGAT) on self-supervised representation. The proposed approach contains three modules of representation learning, graph attention, and aggregation, jointly considering learning on the self-supervised representation and the IsoGAT. Then, we perform experiments for speaker recognition tasks on VoxCeleb1\&2 datasets, with the corresponding experimental results demonstrating the recognition performance for the proposed approach, compared with existing pooling approaches on the self-supervised representation.

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

TL;DR

Abstract

Paper Structure (15 sections, 11 equations, 4 figures, 6 tables)

This paper contains 15 sections, 11 equations, 4 figures, 6 tables.

Introduction
Related Works
Self-Supervised Speech Representation Learning.
Graph Signals and Graph Neural Network.
METHODOLOGY
Representation Learning Module
Graph Attention Module
Aggregation Module
Experimental Setups
The Datasets
Implementation Details
Experimental Results
Experimental Comparisons
Ablation Study
Conclusion

Figures (4)

Figure 1: A diagrammatic overview of the proposed IsoGAT approach, including three modules of representation learning, graph attention, and aggregation, based on the self-supervised representation.
Figure 2: A diagrammatic overview of the pre-training and fine-tuning procedures in the representation learning module, with the left part indicating the pre-training phase and the right part corresponding to the fine-tuning phase.
Figure 3: An overview of the aggregation module including $K$ layers consisting weighted sum on the vertices' states and an MLP for each layer.
Figure 4: The visualized adjacency matrices corresponding to six utterances, respectively, from different speakers, where the brighter pixels represent larger values of the adjacency weights.

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

TL;DR

Abstract

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)