Table of Contents
Fetching ...

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao

TL;DR

SecoustiCodec addresses the challenge of creating a low-bitrate, streaming speech codec that preserves semantic content while disentangling paralinguistic information. It achieves this through independent acoustic/semantic/paralinguistic modeling, a semantic-only VAE+FSQ quantization with high codebook utilization, and a cross-modal frame-level contrastive objective to align text and speech. An acoustic-constrained, multi-stage optimization strategy ensures stable convergence and streaming capability. Empirically, it attains state-of-the-art PESQ at 0.27 kbps and 1 kbps under streaming conditions and demonstrates strong semantic-paralinguistic disentanglement across extensive ablations. The work also provides open-source demo, code, and weights for practical deployment.

Abstract

Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient semantic completeness, limited reconstruction capability, and lack of support for streaming. To address these challenges, we propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codec that disentangles semantic and paralinguistic information in a single-codebook space. To ensure semantic completeness and reconstruction fidelity, paralinguistic encoding is introduced to bridge the information gap between semantic and acoustic encoding. A semantic-only efficient quantization method based on VAE (Variational Autoencoder) and FSQ (Finite Scalar Quantization) is proposed. This approach alleviates the long-tail distribution problem of tokens while maintaining high codebook utilization. A semantic disentanglement method based on contrastive learning is proposed, which aligns text and speech in a joint multimodal frame-level space, effectively removing paralinguistic information from semantic encoding. An acoustic-constrained multi-stage optimization strategy is proposed to ensure robust and stable convergence. Figure~\ref{fig:pesq_kbps_below_2kbps} shows SecoustiCodec achieves SOTA (state-of-the-art) reconstruction quality (PESQ) of 1.77/2.58 at 0.27/1 kbps. The code and model weights for SecoustiCodec will be open-sourced upon the completion of the peer-review process. We've open-sourced SecoustiCodec's demo, code, and model weights.

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

TL;DR

SecoustiCodec addresses the challenge of creating a low-bitrate, streaming speech codec that preserves semantic content while disentangling paralinguistic information. It achieves this through independent acoustic/semantic/paralinguistic modeling, a semantic-only VAE+FSQ quantization with high codebook utilization, and a cross-modal frame-level contrastive objective to align text and speech. An acoustic-constrained, multi-stage optimization strategy ensures stable convergence and streaming capability. Empirically, it attains state-of-the-art PESQ at 0.27 kbps and 1 kbps under streaming conditions and demonstrates strong semantic-paralinguistic disentanglement across extensive ablations. The work also provides open-source demo, code, and weights for practical deployment.

Abstract

Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient semantic completeness, limited reconstruction capability, and lack of support for streaming. To address these challenges, we propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codec that disentangles semantic and paralinguistic information in a single-codebook space. To ensure semantic completeness and reconstruction fidelity, paralinguistic encoding is introduced to bridge the information gap between semantic and acoustic encoding. A semantic-only efficient quantization method based on VAE (Variational Autoencoder) and FSQ (Finite Scalar Quantization) is proposed. This approach alleviates the long-tail distribution problem of tokens while maintaining high codebook utilization. A semantic disentanglement method based on contrastive learning is proposed, which aligns text and speech in a joint multimodal frame-level space, effectively removing paralinguistic information from semantic encoding. An acoustic-constrained multi-stage optimization strategy is proposed to ensure robust and stable convergence. Figure~\ref{fig:pesq_kbps_below_2kbps} shows SecoustiCodec achieves SOTA (state-of-the-art) reconstruction quality (PESQ) of 1.77/2.58 at 0.27/1 kbps. The code and model weights for SecoustiCodec will be open-sourced upon the completion of the peer-review process. We've open-sourced SecoustiCodec's demo, code, and model weights.

Paper Structure

This paper contains 21 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of different speech codecs operating below 2 kbps. The y-axis represents reconstruction quality (PESQ), while the x-axis indicates compression level (kbps). Circle sizes correspond to the number of discrete tokens encoded per second. SecoustiCodec supports streaming and claims SOTA performance in low-bitrate. Although BigCodec achieves comparable results, it neither supports causal streaming nor maintains parameter efficiency comparable to SecoustiCodec.
  • Figure 2: SecoustiCodec employs trained acoustic representations (frame-level continuous values) to constrain the joint training of semantic representations (frame-level discrete values) and paralinguistic representations (global-level continuous values). While we acknowledge that certain paralinguistic cues (e.g., nuanced emotional shifts) exhibit fine-grained variations not fully captured at a global level, we deliberately employ a global-level representation for paralinguistics. This design enables robust semantic decoupling while capturing dominant residual information between acoustic and semantic representations, such as speaker timbre and broad emotional characteristics. We posit that this global-level representation efficiently captures the majority of paralinguistic information while facilitating the core relationship: $Semantic + Paralinguistic \approx Acoustic$
  • Figure 3: SecoustiCodec includes three modeling processes: (a) Acoustic Modeling, (b) Semantic Modeling and (c) Paralinguistic Modeling. Modules outlined in red operate in a streaming manner, while those in blue are non-streaming. Phoneme embeddings $(P)$ are extracted from text, and target semantic embeddings $(S)$, acoustic embeddings $(A)$, and paralinguistic embeddings $(G)$ are extracted from speech. $(P)$ and $(S)$ are used to construct token-acoustic contrastive loss, which learns frame-level (dis)similarity between a batch of speech and text pairs. In the inference process, Acoustic Projection is not required; instead, semantic embedding and paralinguistic embedding are used to predict acoustic embedding. The mean values ($\mu$ and $\hat{\mu}$) from the VAE structure are directly used as inputs during inference, bypassing stochastic sampling. The name "SecoustiCodec" signifies a codec that supports both semantic and a coustic encoding.
  • Figure 4: Codebook Utilization.
  • Figure 5: The Spectrograms, F0, and Energy of synthesized speech (the same semantic coding combining different paralinguistics). The bottom row is the ground-truth. The synthesized speech and the paralinguistic speech exhibit consistency in the numerical range and variation trends of spectrogram, F0, and energy.