Table of Contents
Fetching ...

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

Zirui Ge, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang

TL;DR

This paper tackles pooling in speaker recognition using self-supervised representations by introducing IsoGAT, a three-part architecture that fuses representation learning from wav2vec 2.0, a GAT-based graph attention module with cosine similarity, and an injective aggregation scheme inspired by GIN. The approach directly addresses fixed pooling and non-injective aggregation, yielding speaker-discriminative embeddings and improved EERs on VoxCeleb1&2 compared to multiple baselines. Key contributions include a cosine-based GAT for low-level component weighting, an injective aggregation mechanism, and a two-stage training protocol that leverages both last-layer and all-layer representations. The results indicate that IsoGAT provides a robust pooling strategy for SR with self-supervised representations, with potential for temporal fusion and cross-task knowledge transfer.

Abstract

The emergence of self-supervised representation (i.e., wav2vec 2.0) allows speaker-recognition approaches to process spoken signals through foundation models built on speech data. Nevertheless, effective fusion on the representation requires further investigating, due to the inclusion of fixed or sub-optimal temporal pooling strategies. Despite of improved strategies considering graph learning and graph attention factors, non-injective aggregation still exists in the approaches, which may influence the performance for speaker recognition. In this regard, we propose a speaker recognition approach using Isomorphic Graph ATtention network (IsoGAT) on self-supervised representation. The proposed approach contains three modules of representation learning, graph attention, and aggregation, jointly considering learning on the self-supervised representation and the IsoGAT. Then, we perform experiments for speaker recognition tasks on VoxCeleb1\&2 datasets, with the corresponding experimental results demonstrating the recognition performance for the proposed approach, compared with existing pooling approaches on the self-supervised representation.

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

TL;DR

This paper tackles pooling in speaker recognition using self-supervised representations by introducing IsoGAT, a three-part architecture that fuses representation learning from wav2vec 2.0, a GAT-based graph attention module with cosine similarity, and an injective aggregation scheme inspired by GIN. The approach directly addresses fixed pooling and non-injective aggregation, yielding speaker-discriminative embeddings and improved EERs on VoxCeleb1&2 compared to multiple baselines. Key contributions include a cosine-based GAT for low-level component weighting, an injective aggregation mechanism, and a two-stage training protocol that leverages both last-layer and all-layer representations. The results indicate that IsoGAT provides a robust pooling strategy for SR with self-supervised representations, with potential for temporal fusion and cross-task knowledge transfer.

Abstract

The emergence of self-supervised representation (i.e., wav2vec 2.0) allows speaker-recognition approaches to process spoken signals through foundation models built on speech data. Nevertheless, effective fusion on the representation requires further investigating, due to the inclusion of fixed or sub-optimal temporal pooling strategies. Despite of improved strategies considering graph learning and graph attention factors, non-injective aggregation still exists in the approaches, which may influence the performance for speaker recognition. In this regard, we propose a speaker recognition approach using Isomorphic Graph ATtention network (IsoGAT) on self-supervised representation. The proposed approach contains three modules of representation learning, graph attention, and aggregation, jointly considering learning on the self-supervised representation and the IsoGAT. Then, we perform experiments for speaker recognition tasks on VoxCeleb1\&2 datasets, with the corresponding experimental results demonstrating the recognition performance for the proposed approach, compared with existing pooling approaches on the self-supervised representation.
Paper Structure (15 sections, 11 equations, 4 figures, 6 tables)

This paper contains 15 sections, 11 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A diagrammatic overview of the proposed IsoGAT approach, including three modules of representation learning, graph attention, and aggregation, based on the self-supervised representation.
  • Figure 2: A diagrammatic overview of the pre-training and fine-tuning procedures in the representation learning module, with the left part indicating the pre-training phase and the right part corresponding to the fine-tuning phase.
  • Figure 3: An overview of the aggregation module including $K$ layers consisting weighted sum on the vertices' states and an MLP for each layer.
  • Figure 4: The visualized adjacency matrices corresponding to six utterances, respectively, from different speakers, where the brighter pixels represent larger values of the adjacency weights.