Table of Contents
Fetching ...

Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks

Gašper Beguš, Thomas Lu, Zili Wang

TL;DR

Spontaneous concatenation is introduced: a phenomenon where convolutional neural networks trained on acoustic recordings of individual words start generating outputs with two or even three words concatenated without ever accessing data with multiple words in the input.

Abstract

Computational models of syntax are predominantly text-based. Here we propose that the most basic first step in the evolution of syntax can be modeled directly from raw speech in a fully unsupervised way. We focus on one of the most ubiquitous and elementary suboperation of syntax -- concatenation. We introduce spontaneous concatenation: a phenomenon where convolutional neural networks (CNNs) trained on acoustic recordings of individual words start generating outputs with two or even three words concatenated without ever accessing data with multiple words in the input. We replicate this finding in several independently trained models with different hyperparameters and training data. Additionally, networks trained on two words learn to embed words into novel unobserved word combinations. We also show that the concatenated outputs contain precursors to compositionality. To our knowledge, this is a previously unreported property of CNNs trained in the ciwGAN/fiwGAN setting on raw speech and has implications both for our understanding of how these architectures learn as well as for modeling syntax and its evolution in the brain from raw acoustic inputs. We also propose a potential neural mechanism called disinhibition that outlines a possible neural pathway towards concatenation and compositionality and suggests our modeling is useful for generating testable prediction for biological and artificial neural processing of speech.

Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks

TL;DR

Spontaneous concatenation is introduced: a phenomenon where convolutional neural networks trained on acoustic recordings of individual words start generating outputs with two or even three words concatenated without ever accessing data with multiple words in the input.

Abstract

Computational models of syntax are predominantly text-based. Here we propose that the most basic first step in the evolution of syntax can be modeled directly from raw speech in a fully unsupervised way. We focus on one of the most ubiquitous and elementary suboperation of syntax -- concatenation. We introduce spontaneous concatenation: a phenomenon where convolutional neural networks (CNNs) trained on acoustic recordings of individual words start generating outputs with two or even three words concatenated without ever accessing data with multiple words in the input. We replicate this finding in several independently trained models with different hyperparameters and training data. Additionally, networks trained on two words learn to embed words into novel unobserved word combinations. We also show that the concatenated outputs contain precursors to compositionality. To our knowledge, this is a previously unreported property of CNNs trained in the ciwGAN/fiwGAN setting on raw speech and has implications both for our understanding of how these architectures learn as well as for modeling syntax and its evolution in the brain from raw acoustic inputs. We also propose a potential neural mechanism called disinhibition that outlines a possible neural pathway towards concatenation and compositionality and suggests our modeling is useful for generating testable prediction for biological and artificial neural processing of speech.
Paper Structure (18 sections, 4 equations, 15 figures, 5 tables)

This paper contains 18 sections, 4 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: The architecture of ciwGAN used in the two-second one-word experiment.
  • Figure 2: The suit year (left) output and the rag year (right) from the one-second one-word model. All spectrograms are created in Praat boersma15.
  • Figure 3: The three-word concatentated output box under water. Independently, the second word (under) is somewhat difficult to analyze, but given only five training words, it is clearly the closest output to under.
  • Figure 4: The average number of words generated by the two one-word two-second models are plotted as a function of the first 3 bits of the latent code. The remaining two bits of the latent code are maintained at a value of 0. As in the other trials, each bitstring was tested with 10 sets of latent space values.
  • Figure 5: (top) Predicted values of the logistic regression mixed effects model with the proportion of two-word outputs (one-word output = failure, two-word output or more = a success) as the dependent variable and sum of bits 1--5 as the predictor with 95% confidence limits (Experiment 2 in Table \ref{['tab:parameters']}). Outputs with no transcribed words were removed. The random effect structure involves random intercepts for each of the 10 unique random latent space samples $z$ as well as random slope for sum of bits. The values were obtained with the effects package effects1effects2. (bottom) Predicted values for one (bit1 * bit2 * bit5) of the many interactions of the logistic regression mixed effects model with the proportion of two-word outputs (one-word output = failure, two-word output = a success) as the dependent variable and individual bits with all interactions (including the five-way interaction) as the predictors with 95% confidence limits. The random effect structure involves only the random intercept for each of the 10 unique random latent space samples $z$. The values were obtained with the effects package effects1effects2. Model estimates are given in Table \ref{['estimates']}. Not all interactions show the same effect.
  • ...and 10 more figures