Table of Contents
Fetching ...

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Shih-Heng Wang, Tiantian Feng, Aditya Kommineni, Thanathai Lertpetchpun, Bowen Yi, Xuan Shi, Shrikanth Narayanan

Abstract

Neural Audio Codecs (NACs) are widely adopted in modern speech systems, yet how they encode linguistic and paralinguistic information remains unclear. Improving the interpretability of NAC representations is critical for understanding and deploying them in sensitive applications. Hence, we employ Sparse Autoencoders (SAEs) to decompose dense NAC representations into sparse, interpretable activations. In this work, we focus on a challenging paralinguistic attribute-accent-and propose a framework to quantify NAC interpretability. We evaluate four NAC models under 16 SAE configurations using a relative performance index. Our results show that DAC and SpeechTokenizer achieve the highest interpretability. We further reveal that acoustic-oriented NACs encode accent information primarily in activation magnitudes of sparse representations, whereas phonetic-oriented NACs rely more on activation positions, and that low-bitrate EnCodec variants show higher interpretability.

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Abstract

Neural Audio Codecs (NACs) are widely adopted in modern speech systems, yet how they encode linguistic and paralinguistic information remains unclear. Improving the interpretability of NAC representations is critical for understanding and deploying them in sensitive applications. Hence, we employ Sparse Autoencoders (SAEs) to decompose dense NAC representations into sparse, interpretable activations. In this work, we focus on a challenging paralinguistic attribute-accent-and propose a framework to quantify NAC interpretability. We evaluate four NAC models under 16 SAE configurations using a relative performance index. Our results show that DAC and SpeechTokenizer achieve the highest interpretability. We further reveal that acoustic-oriented NACs encode accent information primarily in activation magnitudes of sparse representations, whereas phonetic-oriented NACs rely more on activation positions, and that low-bitrate EnCodec variants show higher interpretability.
Paper Structure (17 sections, 2 equations, 4 figures, 3 tables)

This paper contains 17 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Sparse Autoencoder (SAE) framework for measuring interpretability of NACs. (A) Extraction of utterance level representation for a given speech input using NAC. (B) SAE training for learning sparse representation $\mathbf{z}$. Position ($\mathbf{z}_\text{pos}$) and Magnitude ($\mathbf{z}_\text{mag}$) are then derived from sparse representation. (C) Logistic regression models are trained for binary accent classification tasks, with the utterance level representation ($\mathbf{u}$) providing reference score $\text{F1}_{ref}$ and sparse representation corresponding to $\text{F1}_{d_z,k}$
  • Figure 2: Interpretability measured by $\Delta$F1 (%) under the US vs. UK setting. Each subplot corresponds to a different latent ratio $q$. The x-axis denotes sparsity level $s$, and higher $\Delta$F1 indicates better interpretability.
  • Figure 3: Interpretability measured by $\Delta$F1 (%) under the US vs. Non-US-UK setting. Each subplot corresponds to a different latent ratio $q$. The x-axis denotes sparsity level $s$, and higher $\Delta$F1 indicates better interpretability.
  • Figure 4: Activation density distribution across ordered latent dimensions on the test set. Dimensions in $\mathbf{z}$ are divided into 16 sequential groups; the x-axis shows group index and the y-axis shows activation density.