Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Shih-Heng Wang; Tiantian Feng; Aditya Kommineni; Thanathai Lertpetchpun; Bowen Yi; Xuan Shi; Shrikanth Narayanan

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Shih-Heng Wang, Tiantian Feng, Aditya Kommineni, Thanathai Lertpetchpun, Bowen Yi, Xuan Shi, Shrikanth Narayanan

Abstract

Neural Audio Codecs (NACs) are widely adopted in modern speech systems, yet how they encode linguistic and paralinguistic information remains unclear. Improving the interpretability of NAC representations is critical for understanding and deploying them in sensitive applications. Hence, we employ Sparse Autoencoders (SAEs) to decompose dense NAC representations into sparse, interpretable activations. In this work, we focus on a challenging paralinguistic attribute-accent-and propose a framework to quantify NAC interpretability. We evaluate four NAC models under 16 SAE configurations using a relative performance index. Our results show that DAC and SpeechTokenizer achieve the highest interpretability. We further reveal that acoustic-oriented NACs encode accent information primarily in activation magnitudes of sparse representations, whereas phonetic-oriented NACs rely more on activation positions, and that low-bitrate EnCodec variants show higher interpretability.

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Abstract

Paper Structure (17 sections, 2 equations, 4 figures, 3 tables)

This paper contains 17 sections, 2 equations, 4 figures, 3 tables.

Introduction
Framework
Neural audio codec representations
Sparse autoencoder
Information encoding and logistic regression
Interpretability measurement
Experimental setup
Dataset
Neural audio codecs
Training & hyperparameters
Result & Analysis
NAC overall interpretability
Analysis: position vs. magnitude
Analysis: bitrate
Conclusion and future work
...and 2 more sections

Figures (4)

Figure 1: Sparse Autoencoder (SAE) framework for measuring interpretability of NACs. (A) Extraction of utterance level representation for a given speech input using NAC. (B) SAE training for learning sparse representation $\mathbf{z}$. Position ($\mathbf{z}_\text{pos}$) and Magnitude ($\mathbf{z}_\text{mag}$) are then derived from sparse representation. (C) Logistic regression models are trained for binary accent classification tasks, with the utterance level representation ($\mathbf{u}$) providing reference score $\text{F1}_{ref}$ and sparse representation corresponding to $\text{F1}_{d_z,k}$
Figure 2: Interpretability measured by $\Delta$F1 (%) under the US vs. UK setting. Each subplot corresponds to a different latent ratio $q$. The x-axis denotes sparsity level $s$, and higher $\Delta$F1 indicates better interpretability.
Figure 3: Interpretability measured by $\Delta$F1 (%) under the US vs. Non-US-UK setting. Each subplot corresponds to a different latent ratio $q$. The x-axis denotes sparsity level $s$, and higher $\Delta$F1 indicates better interpretability.
Figure 4: Activation density distribution across ordered latent dimensions on the test set. Dimensions in $\mathbf{z}$ are divided into 16 sequential groups; the x-axis shows group index and the y-axis shows activation density.

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Abstract

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Authors

Abstract

Table of Contents

Figures (4)