An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits

Kai Li; Fenghua Xie; Hang Chen; Kexin Yuan; Xiaolin Hu

An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits

Kai Li, Fenghua Xie, Hang Chen, Kexin Yuan, Xiaolin Hu

TL;DR

The results of experiments show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters, and suggest that mimicking the anatomical connectome of the mammalian brain has great potential for advancing the development of deep neural networks.

Abstract

Audio-visual approaches involving visual inputs have laid the foundation for recent progress in speech separation. However, the optimization of the concurrent usage of auditory and visual inputs is still an active research area. Inspired by the cortico-thalamo-cortical circuit, in which the sensory processing mechanisms of different modalities modulate one another via the non-lemniscal sensory thalamus, we propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS). First, the CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visual subnetworks, mimicking the functions of the auditory and visual cortical areas. Then, inspired by the large number of connections between cortical regions and the thalamus, the model fuses the auditory and visual information in a thalamic subnetwork through top-down connections. Finally, the model transmits this fused information back to the auditory and visual subnetworks, and the above process is repeated several times. The results of experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters. These results suggest that mimicking the anatomical connectome of the mammalian brain has great potential for advancing the development of deep neural networks. Project repo is https://github.com/JusperLee/CTCNet.

An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits

TL;DR

Abstract

Paper Structure (23 sections, 9 equations, 5 figures, 9 tables)

This paper contains 23 sections, 9 equations, 5 figures, 9 tables.

Introduction
Introduction
Related work
The CTCNet-based AVSS model
Overall pipeline
The CTCNet
Control models
Lip-reading pre-training
Loss function for training
Experiments
Datasets
Implementation details
Evaluation metrics
Results
Hyperparameter setting
...and 8 more sections

Figures (5)

Figure 1: Overview of multimodal information flow in the CTCNet model. (A) The multimodal information processing between the auditory cortex and thalamus of the rodent brain is illustrated on the coronal section. The auditory and visual thalamus and cortices and the multimodal thalamus are labeled using different colors. The transmission and integration pathways for different sensory information are delineated on the right side of the figure. The red, blue, and yellow arrows indicate bottom-up, top-down, and lateral connections, respectively. The higher-order sensory thalamic nuclei are involved in intercortical information transmission and integrate auditory and visual inputs. (B) The multimodal information process of the cortico-thalamo-cortical transthalamic connectivity patterns (left) and CTCNet structure (right). In this panel, $\mathbf{E}$ and $\mathbf{K}_i$ denote the audio mixture embedding and the visual features from the auditory and visual modules, respectively; $\mathbf{A}_{i,t}$ and $\mathbf{A'}_{i,t}$ denote auditory features before and after the multi-modal fusion process, respectively; $\mathbf{V}_{i,t}$ and $\mathbf{V'}_{i,t}$ denote visual features before and after the multi-modal fusion process, respectively. (C) The recurrent process of AV fusion in the CTCNet model over time. The red, blue, and yellow arrows indicate bottom-up, top-down, and lateral connections, respectively. In the figure, $n$ represents the number of cycles in the auditory, visual and thalamic subnetworks, and $m$ represents the number of extra cycles in the auditory subnetwork after AV fusion.
Figure 2: The pipeline of our AVSS network. Our network includes an auditory module, a visual module and an AV fusion module. It takes a video with a speech mixture as input and outputs the separated speech of different speakers.
Figure 3: Alternative models for AV fusion. (A) The DFTNet model was obtained by removing cross-connections between adjacent layers in the CTCNet model, which corresponds to a CTC diagram in the brain without top-down connections along unimodal cortical pathways. (B) The CCNet model was obtained by directly connecting the auditory and visual subnetworks in CTCNet, which was inspired by the cross-connections between the auditory and visual pathways in the brain, where $\mathbf{A}_{i,t,d}$ and $\mathbf{A'}_{i,t,d}$ denote auditory features from different temporal resolutions before and after the multi-modal fusion process, respectively; $\mathbf{V}_{i,t,d}$ and $\mathbf{V'}_{i,t,d}$ denote visual features from different temporal resolutions before and after the multi-modal fusion process, respectively. (C) The CACNet model was obtained by moving the AV fusion subnetwork to the top layers of the auditory and visual subnetworks in CTCNet. (D) The recurrent process of AV fusion in CCNet over time, where the structure of the auditory subnetwork with m cycles is the same as that of the auditory subnetwork of CTCNet, with inputs and outputs $A_{i,t}$ and $A^{'}_{i,t}$, respectively. The notations are the same as those in Fig. \ref{['fig1']}. The yellow arrows in D indicate that only visual and auditory features in the same levels in the two subnetworks are fused.
Figure 4: Visualization of speech separation results obtained by the CTCNet. (A) and (B) Two example target speech spectrograms. The corresponding textual content was labeled in the corresponding positions above the spectrograms. (C) The mixture speech spectrogram. (D) The mixture contour. (E) and (F) The two restored speech spectrograms. (G) Scatter plot of correlation for all test examples (N=3,000). Each point indicates the correlation between the spectrograms of the separated speeches and the target speeches. The point "$\times$" indicates the correlation for the example of the displayed spectrograms. (H) Visualization of the speaker identity of the target speeches (left) and separated speeches (right) by the t-SNE method. Each point corresponds to the speaker's embedding from the x-vector.
Figure 5: Visualization of the speaker identity of the separated speech output by different control models using the t-SNE method. The same convention is used as in Fig. \ref{['fig4']}H.

An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits

TL;DR

Abstract

An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits

Authors

TL;DR

Abstract

Table of Contents

Figures (5)