GMM-ResNet2: Ensemble of Group ResNet Networks for Synthetic Speech Detection

Zhenchun Lei; Hui Yan; Changhong Liu; Yong Zhou; Minglei Ma

GMM-ResNet2: Ensemble of Group ResNet Networks for Synthetic Speech Detection

Zhenchun Lei, Hui Yan, Changhong Liu, Yong Zhou, Minglei Ma

TL;DR

This work tackles synthetic speech detection for anti-spoofing in automatic speaker verification by proposing GMM-ResNet2, an architecture that combines multi-order GMM-derived Log Gaussian Probability features, a grouping-and-ensemble strategy, an optimized residual block, and an ensemble-aware loss. The method extracts rich LGP features using GMMs with multiple component counts, organizes them into groups for parallel, efficient embeddings, and ensembles their predictions to improve robustness. Ablation studies show the ensemble-aware loss provides the most significant gain, with the improved residual block offering notable improvements on certain tasks; collectively, the approach achieves strong results on ASVspoof 2019 LA and competitive performance on 2021 LA and DF benchmarks. The model is efficient due to grouping, includes a public implementation, and has potential for deployment in real-world anti-spoofing systems and future extensions to speaker recognition and self-distillation techniques.

Abstract

Deep learning models are widely used for speaker recognition and spoofing speech detection. We propose the GMM-ResNet2 for synthesis speech detection. Compared with the previous GMM-ResNet model, GMM-ResNet2 has four improvements. Firstly, the different order GMMs have different capabilities to form smooth approximations to the feature distribution, and multiple GMMs are used to extract multi-scale Log Gaussian Probability features. Secondly, the grouping technique is used to improve the classification accuracy by exposing the group cardinality while reducing both the number of parameters and the training time. The final score is obtained by ensemble of all group classifier outputs using the averaging method. Thirdly, the residual block is improved by including one activation function and one batch normalization layer. Finally, an ensemble-aware loss function is proposed to integrate the independent loss functions of all ensemble members. On the ASVspoof 2019 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.0227 and an EER of 0.79\%. On the ASVspoof 2021 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.2362 and an EER of 2.19\%, and represents a relative reductions of 31.4\% and 76.3\% compared with the LFCC-LCNN baseline.

GMM-ResNet2: Ensemble of Group ResNet Networks for Synthetic Speech Detection

TL;DR

Abstract

GMM-ResNet2: Ensemble of Group ResNet Networks for Synthetic Speech Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)