Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems
Quentin Raymondaud, Mickael Rouvier, Richard Dufour
TL;DR
The paper tackles the interpretability of neural based acoustic models used in automatic speech recognition by asking what information is encoded in hidden layers and where it resides. It proposes a protocol that uses intermediate representations from a fixed TDNN-F acoustic model and trains a common ECAPA-TDNN classifier on each layer to probe five diverse tasks including speaker verification, speaking rate, gender, acoustic environments, and sentiment/emotion. The findings show that information such as speaker identity and paralinguistic cues can be present in AMs but is not equally distributed across layers; lower layers tend to structure information while higher layers prune content that is not helpful for phoneme recognition, with MFCC baselines performing better for speaker verification in some cases. The work provides a practical, reproducible method for dissecting neural AMs and informs future design of more interpretable ASR systems.
Abstract
Deep learning architectures have made significant progress in terms of performance in many research areas. The automatic speech recognition (ASR) field has thus benefited from these scientific and technological advances, particularly for acoustic modeling, now integrating deep neural network architectures. However, these performance gains have translated into increased complexity regarding the information learned and conveyed through these black-box architectures. Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). To do so, we propose to evaluate AM performance on a determined set of tasks using intermediate representations (here, at different layer levels). Regarding the performance variation and targeted tasks, we can emit hypothesis about which information is enhanced or perturbed at different architecture steps. Experiments are performed on both speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification. Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition, such as emotion, sentiment or speaker identity. The low-level hidden layers globally appears useful for the structuring of information while the upper ones would tend to delete useless information for phoneme recognition.
