Probing neural audio codecs for distinctions among English nuclear tunes

Juan Pablo Vigneaux; Jennifer Cole

Probing neural audio codecs for distinctions among English nuclear tunes

Juan Pablo Vigneaux, Jennifer Cole

Abstract

State-of-the-art spoken dialogue models (Défossez et al. 2024; Schalkwyk et al. 2025) use neural audio codecs to "tokenize" audio signals into a lower-frequency stream of vectorial latent representations, each quantized using a hierarchy of vector codebooks. A transformer layer allows these representations to reflect some time- and context-dependent patterns. We train probes on labeled audio data from Cole et al. (2023) to test whether the pitch trajectories that characterize English phrase-final (nuclear) intonational tunes are among these patterns. Results: Linear probes trained on the unquantized latents or some of the associated codewords yield above-chance accuracy in distinguishing eight phonologically specified nuclear tunes with monotonal pitch accents (top average test accuracy (TATA): 0.31) and the five clusters of these tunes that are robust in human speech production and perception (TATA: 0.45). Greater accuracy (TATAs: 0.74-0.89) is attained for binary distinctions between classes of rising vs. falling tunes, respectively used for questions and assertions. Information about tunes is spread among all codebooks, which calls into question a distinction between 'semantic' and 'acoustic' codebooks found in the literature. Accuracies improve with nonlinear probes, but discrimination among the five clusters remains far from human performance, suggesting a fundamental limitation of current codecs.

Probing neural audio codecs for distinctions among English nuclear tunes

Abstract

Paper Structure (15 sections, 2 equations, 3 figures, 1 table)

This paper contains 15 sections, 2 equations, 3 figures, 1 table.

Introduction
Motivation
Probes and interpretability
Nuclear tunes
Methods
Labeled audio
Neural encodings
Probes
Aggregation of the latent representations
Dimensionality reduction
Classifiers
Details of training
Results
Discussion
Conclusion and Perspectives

Figures (3)

Figure 1: Accuracy of optimal linear probes on their test sets. The colors indicate the kind of input used to train the linear probe. We also represent the ZeroR baseline.
Figure 2: Confusion matrix of the test set predictions generated by the linear probe trained on unquantized embeddings for the 5 class classification problem.
Figure 3: Accuracy of linear and nonlinear probes on unquantized embeddings.

Probing neural audio codecs for distinctions among English nuclear tunes

Abstract

Probing neural audio codecs for distinctions among English nuclear tunes

Authors

Abstract

Table of Contents

Figures (3)