CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

Xiaosu Su; Zihan Sun; Peilei Jia; Jun Gao

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

Xiaosu Su, Zihan Sun, Peilei Jia, Jun Gao

Abstract

Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored. In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes. To resolve the conflict between stable timbre preservation and context-adaptive expression, we propose a hierarchical variational conditioning module with an utterance-level speaker encoder to better balance stable timbre preservation and context-adaptive expression. This enables timbre reuse while keeping expression adaptive to the current utterance and, in dialogue, the surrounding context. We also build a comprehensive evaluation protocol for both single-utterance and dialogue settings. Experiments show that CapTalk achieves state-of-the-art performance on a single-utterance voice design benchmark and delivers better expression controllability and contextual appropriateness in multi-turn dialogue. Audio samples are available at: https://anonymous.4open.science/api/repo/Captalk-D601/file/index.html.

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

Abstract

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

Abstract

Paper Structure

Table of Contents

Figures (2)