VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Chunyu Qiang; Wang Geng; Yi Zhao; Ruibo Fu; Tao Wang; Cheng Gong; Tianrui Wang; Qiuyu Liu; Jiangyan Yi; Zhengqi Wen; Chen Zhang; Hao Che; Longbiao Wang; Jianwu Dang; Jianhua Tao

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, Jianhua Tao

TL;DR

VQ-CTAP tackles the challenge of learning frame-level cross-modal representations between text and speech to support TTS, VC, and ASR. It introduces a cross-modal aligned sequence transcoder that produces a 25 Hz discrete speech embedding via vector quantization, along with a token-acoustic contrastive loss and a semantic-transfer-wise paralinguistic consistency loss, optimized with a stepping strategy. The model enables plug-and-play deployment for downstream tasks without fine-tuning, and includes a sequence-aware semantic connector to bridge semantic and acoustic spaces. It achieves a remarkable 960-fold compression from 24 kHz inputs while maintaining high-quality synthesis and recognition, and demonstrates robustness with large unlabeled data. This work advances cross-modal speech representations by decoupling semantics from paralinguistics and enabling efficient, flexible downstream use.

Abstract

Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at https://qiangchunyu.github.io/VQCTAP/

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

TL;DR

Abstract

Paper Structure (28 sections, 12 equations, 7 figures, 5 tables)

This paper contains 28 sections, 12 equations, 7 figures, 5 tables.

Introduction
Related work
Speech Representaion
Contrastive Learning
TTS, VC, and ASR tasks
Method
Cross-Modal Aligned Sequence Transcoder
Semantic-Transfer-Wise Paralinguistic Consistency Loss
Token-Acoustic Contrastive Loss
Stepping Optimization Strategy
Plug-and-Play for Downstream Tasks
TTS Pipeline
Sequence-Aware Semantic Connector
VC Pipeline
ASR Pipeline
...and 13 more sections

Figures (7)

Figure 1: TTS, VC and ASR systems based on semantic coding
Figure 2: The main body of VQ-CTAP is the cross-modal aligned sequence transcoder, which jointly trains three encoders and two decoders. It extracts phoneme embedding $(P)$ from the target text, target speech embedding $(S)$ and prompt paralinguistic embedding $(G)$ from the target speech, as well as random speech embedding $(R)$ from the random speech. $(P)$ and $(S)$ are used to construct contrastive token-acoustic pre-training, which learns frame-level (dis)similarity between a batch of speech and text pairs. Additionally, two decoders are adapted for downstream tasks such as TTS, ASR, and VC. To enable the semantic-paralinguistic decoupling ability of the representation, unlabeled random speech is used to calculate Semantic-Transfer-wise Paralinguistic Consistency Loss.
Figure 3: The pre-trained VQ-CTAP is used for downstream TTS, VC and ASR tasks.
Figure 4: The architecture of sequence-aware semantic connector
Figure 5: t-SNE plot of phoneme/speech embedding for 20 speakers. The $\bigstar$ represents $P$, and the other shapes represent $S$ for different speakers, with different colors indicating the positions of corresponding $P$ and $S$. For the red and orange phonemes "k", although the phonemes are the same and the positions are different, the corresponding $P$ and $S$ are not entangled.
...and 2 more figures

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

TL;DR

Abstract

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Authors

TL;DR

Abstract

Table of Contents

Figures (7)