Table of Contents
Fetching ...

Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech

Bruno Ferenc Šegedin, Gasper Beguš

TL;DR

The paper investigates how linguistically relevant information is encoded in the fully-connected layer of ciwGAN, introducing two interpretability techniques to analyze and manipulate FC representations. By examining weight matrices and treating them as inputs to subsequent layers, the study reveals lexically grounded and sublexical structure within the FC layer, including vowel encoding that generalizes across lexical items. The findings argue against wholly holistic, indiscernible lexical representations, demonstrating that sublexical patterns are shared and partially compositional within the FC weights. These methods advance interpretability in speech-generating GANs and offer a framework for linking latent representations to phonological knowledge with potential parallels to human speech processing.

Abstract

Interpretability work on the convolutional layers of CNNs has primarily focused on computer vision, but some studies also explore correspondences between the latent space and the output in the audio domain. However, it has not been thoroughly examined how acoustic and linguistic information is represented in the fully connected (FC) layer that bridges the latent space and convolutional layers. The current study presents the first exploration of how the FC layer of CNNs for speech synthesis encodes linguistically relevant information. We propose two techniques for exploration of the fully connected layer. In Experiment 1, we use weight matrices as inputs into convolutional layers. In Experiment 2, we manipulate the FC layer to explore how symbolic-like representations are encoded in CNNs. We leverage the fact that the FC layer outputs a feature map and that variable-specific weight matrices are temporally structured to (1) demonstrate how the distribution of learned weights varies between latent variables in systematic ways and (2) demonstrate how manipulating the FC layer while holding constant subsequent model parameters affects the output. We ultimately present an FC manipulation that can output a single segment. Using this technique, we show that lexically specific latent codes in generative CNNs (ciwGAN) have shared lexically invariant sublexical representations in the FC-layer weights, showing that ciwGAN encodes lexical information in a linguistically principled manner.

Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech

TL;DR

The paper investigates how linguistically relevant information is encoded in the fully-connected layer of ciwGAN, introducing two interpretability techniques to analyze and manipulate FC representations. By examining weight matrices and treating them as inputs to subsequent layers, the study reveals lexically grounded and sublexical structure within the FC layer, including vowel encoding that generalizes across lexical items. The findings argue against wholly holistic, indiscernible lexical representations, demonstrating that sublexical patterns are shared and partially compositional within the FC weights. These methods advance interpretability in speech-generating GANs and offer a framework for linking latent representations to phonological knowledge with potential parallels to human speech processing.

Abstract

Interpretability work on the convolutional layers of CNNs has primarily focused on computer vision, but some studies also explore correspondences between the latent space and the output in the audio domain. However, it has not been thoroughly examined how acoustic and linguistic information is represented in the fully connected (FC) layer that bridges the latent space and convolutional layers. The current study presents the first exploration of how the FC layer of CNNs for speech synthesis encodes linguistically relevant information. We propose two techniques for exploration of the fully connected layer. In Experiment 1, we use weight matrices as inputs into convolutional layers. In Experiment 2, we manipulate the FC layer to explore how symbolic-like representations are encoded in CNNs. We leverage the fact that the FC layer outputs a feature map and that variable-specific weight matrices are temporally structured to (1) demonstrate how the distribution of learned weights varies between latent variables in systematic ways and (2) demonstrate how manipulating the FC layer while holding constant subsequent model parameters affects the output. We ultimately present an FC manipulation that can output a single segment. Using this technique, we show that lexically specific latent codes in generative CNNs (ciwGAN) have shared lexically invariant sublexical representations in the FC-layer weights, showing that ciwGAN encodes lexical information in a linguistically principled manner.
Paper Structure (23 sections, 3 equations, 17 figures)

This paper contains 23 sections, 3 equations, 17 figures.

Figures (17)

  • Figure 1: The model architecture of CiwGAN. The diagram is based on begus22Interspeech. Throughout this paper, we refer to the axis represented vertically in the generator diagram as the "time-axis".
  • Figure 2: A schematic representation showing that the output of the FC-layer (before ReLU activation) can be represented as the sum of 100 weight matrices each scaled by a particular random variable value.
  • Figure 3: Average absolute weights of weight matrices for every variable in the latent space. The first 9 points are the latent codes (c) while the rest of the points are the random noise variables (z).
  • Figure 4: Average of absolute weight values along channel length. The blue curves represent the weights of uniformly distributed z-variables, while the colored curves represent latent codes. This illustrates that weights are concentrated in areas along the time-axis where the lexical item is situated in the training data.
  • Figure 5: Output waveforms, each derived from passing variable-specific weight matrices as inputs into the convolutional layers. Each waveform unambiguously matches one of the nine lexical items that the network was trained on. For compactness, only the first 11000 samples of each 16340-sample output is shown.
  • ...and 12 more figures