Table of Contents
Fetching ...

Multimodal Input Aids a Bayesian Model of Phonetic Learning

Sophia Zhi, Roger P. Levy, Stephan C. Meylan

TL;DR

The paper investigates whether visual mouth movements facilitate phonetic learning beyond audio alone by embedding audiovisual information into a Bayesian clustering framework. It extends prior unimodal approaches with synthetic deepfake mouth videos and a Dirichlet Process Gaussian Mixture Model to form multimodal token representations learned from audio-visual windows, evaluated via ABX discrimination. Key findings show audiovisual training improves phoneme discrimination, including when tested on audio-only data, and provides substantial gains in noisy conditions, suggesting lasting benefits to acoustic representations. The work demonstrates the feasibility of using visual cues to support phonetic learning and offers a computational account of how children might leverage visual information during speech acquisition.

Abstract

One of the many tasks facing the typically-developing child language learner is learning to discriminate between the distinctive sounds that make up words in their native language. Here we investigate whether multimodal information--specifically adult speech coupled with video frames of speakers' faces--benefits a computational model of phonetic learning. We introduce a method for creating high-quality synthetic videos of speakers' faces for an existing audio corpus. Our learning model, when both trained and tested on audiovisual inputs, achieves up to a 8.1% relative improvement on a phoneme discrimination battery compared to a model trained and tested on audio-only input. It also outperforms the audio model by up to 3.9% when both are tested on audio-only data, suggesting that visual information facilitates the acquisition of acoustic distinctions. Visual information is especially beneficial in noisy audio environments, where an audiovisual model closes 67% of the loss in discrimination performance of the audio model in noise relative to a non-noisy environment. These results demonstrate that visual information benefits an ideal learner and illustrate some of the ways that children might be able to leverage visual cues when learning to discriminate speech sounds.

Multimodal Input Aids a Bayesian Model of Phonetic Learning

TL;DR

The paper investigates whether visual mouth movements facilitate phonetic learning beyond audio alone by embedding audiovisual information into a Bayesian clustering framework. It extends prior unimodal approaches with synthetic deepfake mouth videos and a Dirichlet Process Gaussian Mixture Model to form multimodal token representations learned from audio-visual windows, evaluated via ABX discrimination. Key findings show audiovisual training improves phoneme discrimination, including when tested on audio-only data, and provides substantial gains in noisy conditions, suggesting lasting benefits to acoustic representations. The work demonstrates the feasibility of using visual cues to support phonetic learning and offers a computational account of how children might leverage visual information during speech acquisition.

Abstract

One of the many tasks facing the typically-developing child language learner is learning to discriminate between the distinctive sounds that make up words in their native language. Here we investigate whether multimodal information--specifically adult speech coupled with video frames of speakers' faces--benefits a computational model of phonetic learning. We introduce a method for creating high-quality synthetic videos of speakers' faces for an existing audio corpus. Our learning model, when both trained and tested on audiovisual inputs, achieves up to a 8.1% relative improvement on a phoneme discrimination battery compared to a model trained and tested on audio-only input. It also outperforms the audio model by up to 3.9% when both are tested on audio-only data, suggesting that visual information facilitates the acquisition of acoustic distinctions. Visual information is especially beneficial in noisy audio environments, where an audiovisual model closes 67% of the loss in discrimination performance of the audio model in noise relative to a non-noisy environment. These results demonstrate that visual information benefits an ideal learner and illustrate some of the ways that children might be able to leverage visual cues when learning to discriminate speech sounds.
Paper Structure (14 sections, 3 equations, 5 figures, 1 table)

This paper contains 14 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Two hypotheses about what matters for the acquisition of phonetic categories: Under the first or "unimodal" hypothesis, children learn from and use the acoustic signal alone; under the second "multimodal" hypothesis, child learners also make use of the visual information from the speaker's articulators.
  • Figure 2: Overview of dataset creation, derivation of acoustic and visual embeddings, clustering, and evaluation. We vary the modalities used at both train and test time and then test the resulting models' phoneme discrimination performance.
  • Figure 3: Video feature extraction. Each frame is transformed to a low-dimensional representation, and a set of video features that captures both static and dynamic information is derived by concatenating 1), the middle frame's representation with 2), the difference between that and the previous frame's representation and with 3), the difference between the next frame's representation and the middle frame's representation.
  • Figure 4: ABX similarity calculation for evaluation. We compare audio/video recordings of potentially different durations using dynamic time warping to match each window in one recording to a window in the other, then calculate the overall dissimilarity by averaging the divergence between each pair of matched windows. $w$ indicates window and $c$ indicates cluster assignment.
  • Figure 5: Overall ABX phoneme discrimination score for all combinations of train/test modalities. Results are averaged over 10 fitted DPGMMs and error bars mark 95% confidence intervals.