Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation
Zirui Ge, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang
TL;DR
This paper tackles pooling in speaker recognition using self-supervised representations by introducing IsoGAT, a three-part architecture that fuses representation learning from wav2vec 2.0, a GAT-based graph attention module with cosine similarity, and an injective aggregation scheme inspired by GIN. The approach directly addresses fixed pooling and non-injective aggregation, yielding speaker-discriminative embeddings and improved EERs on VoxCeleb1&2 compared to multiple baselines. Key contributions include a cosine-based GAT for low-level component weighting, an injective aggregation mechanism, and a two-stage training protocol that leverages both last-layer and all-layer representations. The results indicate that IsoGAT provides a robust pooling strategy for SR with self-supervised representations, with potential for temporal fusion and cross-task knowledge transfer.
Abstract
The emergence of self-supervised representation (i.e., wav2vec 2.0) allows speaker-recognition approaches to process spoken signals through foundation models built on speech data. Nevertheless, effective fusion on the representation requires further investigating, due to the inclusion of fixed or sub-optimal temporal pooling strategies. Despite of improved strategies considering graph learning and graph attention factors, non-injective aggregation still exists in the approaches, which may influence the performance for speaker recognition. In this regard, we propose a speaker recognition approach using Isomorphic Graph ATtention network (IsoGAT) on self-supervised representation. The proposed approach contains three modules of representation learning, graph attention, and aggregation, jointly considering learning on the self-supervised representation and the IsoGAT. Then, we perform experiments for speaker recognition tasks on VoxCeleb1\&2 datasets, with the corresponding experimental results demonstrating the recognition performance for the proposed approach, compared with existing pooling approaches on the self-supervised representation.
