Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

Jinyang Wu; Zihan Pan; Qiquan Zhang; Sailor Hardik Bhupendra; Soumik Mondal

Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

Jinyang Wu, Zihan Pan, Qiquan Zhang, Sailor Hardik Bhupendra, Soumik Mondal

Abstract

Neural audio codecs discretize speech via residual vector quantization (RVQ), forming a coarse-to-fine hierarchy across quantizers. While codec models have been explored for representation learning, their discrete structure remains underutilized in speech deepfake detection. In particular, different quantization levels capture complementary acoustic cues, where early quantizers encode coarse structure and later quantizers refine residual details that reveal synthesis artifacts. Existing systems either rely on continuous encoder features or ignore this quantizer-level hierarchy. We propose a hierarchy-aware representation learning framework that models quantizer-level contributions through learnable global weighting, enabling structured codec representations aligned with forensic cues. Keeping the speech encoder backbone frozen and updating only 4.4% additional parameters, our method achieves relative EER reductions of 46.2% on ASVspoof 2019 and 13.9% on ASVspoof5 over strong baselines.

Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

Abstract

Paper Structure (16 sections, 9 equations, 3 figures, 2 tables)

This paper contains 16 sections, 9 equations, 3 figures, 2 tables.

Introduction
Related Work
SSL-based deepfake detection and parameter-efficient adaptation
Neural audio codecs: continuous latents and discrete RVQ hierarchy
Method
Hierarchy codec modeling
Quantizer-Aware Dimension-wise Static Aggregation (QAF-Static)
Lightweight SSL--Codec Fusion
Experimental Setup
Datasets and Evaluation Metric
Model Configuration and Training Details
Results and Discussion
Hierarchy-aware modeling of codec representations
Complementarity with SSL-based representations
Cross-codec robustness evaluation.
...and 1 more sections

Figures (3)

Figure 1: Overview of the proposed speech deepfake detection framework. SSL features from WavLM (with Attentive Merging) are fused with codec representations through quantizer-aware weighting over RVQ levels. A quantizer mean pooling baseline is included for comparison.
Figure 2: Detection performance across codec groups in the CodecFake benchmark. The proposed quantizer-aware static fusion improves over the ATTM-LSTM baseline and codec concatenation on codec family group B.
Figure 3: Learned quantizer contribution weights ($\alpha_q$) on the ASVspoof 5 dataset. Both the SSL encoder and codec encoder are frozen during training

Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

Abstract

Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

Authors

Abstract

Table of Contents

Figures (3)