Table of Contents
Fetching ...

STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition

Siyu Wang, Haitao Li, Donglai Zhu

TL;DR

This paper introduces STCTS, a three-stream semantic compression framework that transmits text, sparse prosody, and a speaker timbre embedding to enable natural voice communication at ~80 bps. By using explicit content, prosody, and timbre representations, STCTS achieves large bitrate reductions while preserving intelligibility, speaker identity, and perceptual quality, through modular STT/TTS components and targeted compression. Key findings include a bimodal prosody quality distribution favoring sparse or dense updates, robust performance under noise with prioritized transmission, and real-time feasibility on consumer hardware. The approach offers interpretable, privacy-friendly, and upgradeable advantages over end-to-end neural codecs, with practical potential for maritime, satellite, tactical, and other bandwidth-constrained scenarios.

Abstract

Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS, a generative semantic compression framework enabling natural voice communication at 80 bps. STCTS explicitly decomposes speech into linguistic content, prosodic expression, and speaker timbre, applying tailored compression: context-aware text encoding (70 bps), sparse prosody transmission via TTS interpolation (<14 bps at 0.1-1 Hz), and amortized speaker embedding. Evaluations on LibriSpeech demonstrate a 75x bitrate reduction versus Opus (6 kbps) and 12x versus EnCodec (1 kbps), while maintaining perceptual quality (NISQA MOS > 4.26), graceful degradation under packet loss and noise resilience. We also discover a bimodal quality distribution with prosody sampling rate: sparse and dense updates both achieve high quality, while mid-range rates degrade due to perceptual discontinuities--guiding optimal configuration design. Beyond efficiency, our modular architecture supports privacy-preserving encryption, human-interpretable transmission, and flexible deployment on edge devices, offering a robust solution for ultra-low bandwidth scenarios.

STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition

TL;DR

This paper introduces STCTS, a three-stream semantic compression framework that transmits text, sparse prosody, and a speaker timbre embedding to enable natural voice communication at ~80 bps. By using explicit content, prosody, and timbre representations, STCTS achieves large bitrate reductions while preserving intelligibility, speaker identity, and perceptual quality, through modular STT/TTS components and targeted compression. Key findings include a bimodal prosody quality distribution favoring sparse or dense updates, robust performance under noise with prioritized transmission, and real-time feasibility on consumer hardware. The approach offers interpretable, privacy-friendly, and upgradeable advantages over end-to-end neural codecs, with practical potential for maritime, satellite, tactical, and other bandwidth-constrained scenarios.

Abstract

Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS, a generative semantic compression framework enabling natural voice communication at 80 bps. STCTS explicitly decomposes speech into linguistic content, prosodic expression, and speaker timbre, applying tailored compression: context-aware text encoding (70 bps), sparse prosody transmission via TTS interpolation (<14 bps at 0.1-1 Hz), and amortized speaker embedding. Evaluations on LibriSpeech demonstrate a 75x bitrate reduction versus Opus (6 kbps) and 12x versus EnCodec (1 kbps), while maintaining perceptual quality (NISQA MOS > 4.26), graceful degradation under packet loss and noise resilience. We also discover a bimodal quality distribution with prosody sampling rate: sparse and dense updates both achieve high quality, while mid-range rates degrade due to perceptual discontinuities--guiding optimal configuration design. Beyond efficiency, our modular architecture supports privacy-preserving encryption, human-interpretable transmission, and flexible deployment on edge devices, offering a robust solution for ultra-low bandwidth scenarios.

Paper Structure

This paper contains 44 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Operational scenario. Edge devices (e.g., on maritime vessels) transmit decomposed speech components (text, prosody, timbre) over constrained satellite links. The receiver reconstructs natural speech from these semantic streams, overcoming high latency and packet loss.
  • Figure 2: STCTS system architecture. The sender decomposes speech into three orthogonal components—text (continuous, $\sim$70 bps), prosody (sparse keyframes, 0.7--14 bps), and timbre (one-time transmission, amortized)—each compressed with tailored strategies. These are transmitted via WebRTC data channels with prioritized delivery. The receiver reconstructs natural speech via TTS conditioning on all three components, with timbre profiles cached locally for recurring speakers.
  • Figure 3: Sparse prosody interpolation principle. The system transmits prosody keyframes at a very low rate (e.g., 0.5 Hz, red dots). The receiver reconstructs the continuous pitch contour (blue line) via cubic spline interpolation, which closely approximates the original macro-prosody (gray dashed line) while discarding micro-jitter, achieving ultra-low prosody bitrate ($<$14 bps).
  • Figure 4: Prioritized transmission and error handling mechanism. Text packets (High Priority) are retransmitted upon loss to ensure semantic integrity. Prosody Keyframes (Medium Priority) are retransmitted to maintain interpolation anchors, while Prosody Deltas (Low Priority) are discarded if lost, allowing graceful degradation. Timbre packets are sent once per speaker and cached, amortizing bandwidth cost.
  • Figure 5: Prosody sampling rate analysis across four key dimensions. The results reveal a bimodal quality distribution: high quality is achieved at both sparse rates (0.1--1 Hz) and dense rates ($>6$ Hz), while mid-range rates suffer from interpolation artifacts.