Table of Contents
Fetching ...

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, Jianhua Tao

TL;DR

VQ-CTAP tackles the challenge of learning frame-level cross-modal representations between text and speech to support TTS, VC, and ASR. It introduces a cross-modal aligned sequence transcoder that produces a 25 Hz discrete speech embedding via vector quantization, along with a token-acoustic contrastive loss and a semantic-transfer-wise paralinguistic consistency loss, optimized with a stepping strategy. The model enables plug-and-play deployment for downstream tasks without fine-tuning, and includes a sequence-aware semantic connector to bridge semantic and acoustic spaces. It achieves a remarkable 960-fold compression from 24 kHz inputs while maintaining high-quality synthesis and recognition, and demonstrates robustness with large unlabeled data. This work advances cross-modal speech representations by decoupling semantics from paralinguistics and enabling efficient, flexible downstream use.

Abstract

Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at https://qiangchunyu.github.io/VQCTAP/

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

TL;DR

VQ-CTAP tackles the challenge of learning frame-level cross-modal representations between text and speech to support TTS, VC, and ASR. It introduces a cross-modal aligned sequence transcoder that produces a 25 Hz discrete speech embedding via vector quantization, along with a token-acoustic contrastive loss and a semantic-transfer-wise paralinguistic consistency loss, optimized with a stepping strategy. The model enables plug-and-play deployment for downstream tasks without fine-tuning, and includes a sequence-aware semantic connector to bridge semantic and acoustic spaces. It achieves a remarkable 960-fold compression from 24 kHz inputs while maintaining high-quality synthesis and recognition, and demonstrates robustness with large unlabeled data. This work advances cross-modal speech representations by decoupling semantics from paralinguistics and enabling efficient, flexible downstream use.

Abstract

Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at https://qiangchunyu.github.io/VQCTAP/
Paper Structure (28 sections, 12 equations, 7 figures, 5 tables)

This paper contains 28 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: TTS, VC and ASR systems based on semantic coding
  • Figure 2: The main body of VQ-CTAP is the cross-modal aligned sequence transcoder, which jointly trains three encoders and two decoders. It extracts phoneme embedding $(P)$ from the target text, target speech embedding $(S)$ and prompt paralinguistic embedding $(G)$ from the target speech, as well as random speech embedding $(R)$ from the random speech. $(P)$ and $(S)$ are used to construct contrastive token-acoustic pre-training, which learns frame-level (dis)similarity between a batch of speech and text pairs. Additionally, two decoders are adapted for downstream tasks such as TTS, ASR, and VC. To enable the semantic-paralinguistic decoupling ability of the representation, unlabeled random speech is used to calculate Semantic-Transfer-wise Paralinguistic Consistency Loss.
  • Figure 3: The pre-trained VQ-CTAP is used for downstream TTS, VC and ASR tasks.
  • Figure 4: The architecture of sequence-aware semantic connector
  • Figure 5: t-SNE plot of phoneme/speech embedding for 20 speakers. The $\bigstar$ represents $P$, and the other shapes represent $S$ for different speakers, with different colors indicating the positions of corresponding $P$ and $S$. For the red and orange phonemes "k", although the phonemes are the same and the positions are different, the corresponding $P$ and $S$ are not entangled.
  • ...and 2 more figures