Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech
Bruno Ferenc Šegedin, Gasper Beguš
TL;DR
The paper investigates how linguistically relevant information is encoded in the fully-connected layer of ciwGAN, introducing two interpretability techniques to analyze and manipulate FC representations. By examining weight matrices and treating them as inputs to subsequent layers, the study reveals lexically grounded and sublexical structure within the FC layer, including vowel encoding that generalizes across lexical items. The findings argue against wholly holistic, indiscernible lexical representations, demonstrating that sublexical patterns are shared and partially compositional within the FC weights. These methods advance interpretability in speech-generating GANs and offer a framework for linking latent representations to phonological knowledge with potential parallels to human speech processing.
Abstract
Interpretability work on the convolutional layers of CNNs has primarily focused on computer vision, but some studies also explore correspondences between the latent space and the output in the audio domain. However, it has not been thoroughly examined how acoustic and linguistic information is represented in the fully connected (FC) layer that bridges the latent space and convolutional layers. The current study presents the first exploration of how the FC layer of CNNs for speech synthesis encodes linguistically relevant information. We propose two techniques for exploration of the fully connected layer. In Experiment 1, we use weight matrices as inputs into convolutional layers. In Experiment 2, we manipulate the FC layer to explore how symbolic-like representations are encoded in CNNs. We leverage the fact that the FC layer outputs a feature map and that variable-specific weight matrices are temporally structured to (1) demonstrate how the distribution of learned weights varies between latent variables in systematic ways and (2) demonstrate how manipulating the FC layer while holding constant subsequent model parameters affects the output. We ultimately present an FC manipulation that can output a single segment. Using this technique, we show that lexically specific latent codes in generative CNNs (ciwGAN) have shared lexically invariant sublexical representations in the FC-layer weights, showing that ciwGAN encodes lexical information in a linguistically principled manner.
