Table of Contents
Fetching ...

Vision-Integrated High-Quality Neural Speech Coding

Yao Guo, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling

TL;DR

The paper addresses the limitations of unimodal neural speech codecs by incorporating visual lip information to boost quality and robustness without increasing bitrate. It introduces VNSC, a three-module architecture consisting of a speech coding module (MDCTCodec-based), an image analysis-synthesis module, and a feature fusion module, enabling explicit integration when visuals are available (VA) and implicit distillation when they are not (VUA). Training optimizes audio-visual losses, including an image reconstruction loss $\mathcal{L}_I$ and a distillation loss $\mathcal{L}_D$ (for VUA), and demonstrates higher PESQ, STOI, ViSQOL, and SSNR than baselines on the TaL dataset. The results show improved decoded speech quality and noise robustness without bitrate increase, making VNSC practical for vision-assisted speech coding, with future work on reducing latency and enabling streaming.

Abstract

This paper proposes a novel vision-integrated neural speech codec (VNSC), which aims to enhance speech coding quality by leveraging visual modality information. In VNSC, the image analysis-synthesis module extracts visual features from lip images, while the feature fusion module facilitates interaction between the image analysis-synthesis module and the speech coding module, transmitting visual information to assist the speech coding process. Depending on whether visual information is available during the inference stage, the feature fusion module integrates visual features into the speech coding module using either explicit integration or implicit distillation strategies. Experimental results confirm that integrating visual information effectively improves the quality of the decoded speech and enhances the noise robustness of the neural speech codec, without increasing the bitrate.

Vision-Integrated High-Quality Neural Speech Coding

TL;DR

The paper addresses the limitations of unimodal neural speech codecs by incorporating visual lip information to boost quality and robustness without increasing bitrate. It introduces VNSC, a three-module architecture consisting of a speech coding module (MDCTCodec-based), an image analysis-synthesis module, and a feature fusion module, enabling explicit integration when visuals are available (VA) and implicit distillation when they are not (VUA). Training optimizes audio-visual losses, including an image reconstruction loss and a distillation loss (for VUA), and demonstrates higher PESQ, STOI, ViSQOL, and SSNR than baselines on the TaL dataset. The results show improved decoded speech quality and noise robustness without bitrate increase, making VNSC practical for vision-assisted speech coding, with future work on reducing latency and enabling streaming.

Abstract

This paper proposes a novel vision-integrated neural speech codec (VNSC), which aims to enhance speech coding quality by leveraging visual modality information. In VNSC, the image analysis-synthesis module extracts visual features from lip images, while the feature fusion module facilitates interaction between the image analysis-synthesis module and the speech coding module, transmitting visual information to assist the speech coding process. Depending on whether visual information is available during the inference stage, the feature fusion module integrates visual features into the speech coding module using either explicit integration or implicit distillation strategies. Experimental results confirm that integrating visual information effectively improves the quality of the decoded speech and enhances the noise robustness of the neural speech codec, without increasing the bitrate.

Paper Structure

This paper contains 15 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An overview of the proposed VNSC.
  • Figure 2: Structural details of the image analysis-synthesis module and the feature fusion module in VNSC. Here, Conv3D, TransConv3D, and Conv1D represent 3D convolution, transposed 3D convolution and 1D convolution operations, respectively. For simplicity, only the MCNX v2 blocks of the speech encoder in the speech coding module is depicted.