Table of Contents
Fetching ...

PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models

Rajarshi Roy, Jonathan Raiman, Sang-gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, Bryan Catanzaro

TL;DR

PersonaPlex addresses the rigidity of fixed voice and role in duplex speech models by introducing a hybrid system prompt that combines textual role conditioning with audio voice prompts. It integrates a Moshi-based full-duplex architecture with zero-shot voice cloning, trained on a large synthetic corpus generated from open-source LLMs and TTS, and evaluated on Full-Duplex-Bench and an extended multi-role Service-Duplex-Bench. Results show improved role adherence, speaker similarity, and dialog naturalness compared to state-of-the-art duplex and hybrid LLM-based speech systems, approaching the performance of closed commercial systems. The work enables scalable, personalized, role-conditioned conversational agents and provides released checkpoints with expanded data and refined prompting to boost naturalness and backchanneling.

Abstract

Recent advances in duplex speech models have enabled natural, low-latency speech-to-speech interactions. However, existing models are restricted to a fixed role and voice, limiting their ability to support structured, role-driven real-world applications and personalized interactions. In this work, we introduce PersonaPlex, a duplex conversational speech model that incorporates hybrid system prompts, combining role conditioning with text prompts and voice cloning with speech samples. PersonaPlex is trained on a large-scale synthetic dataset of paired prompts and user-agent conversations, generated with open-source large language models (LLM) and text-to-speech (TTS) models. To evaluate role conditioning in real-world settings, we extend the Full-Duplex-Bench benchmark beyond a single assistant role to multi-role customer service scenarios. Experiments show that PersonaPlex achieves strong role-conditioned behavior, voice-conditioned speech, and natural conversational responsiveness, surpassing state-of-the-art duplex speech models and hybrid large language model-based speech systems in role adherence, speaker similarity, latency, and naturalness.

PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models

TL;DR

PersonaPlex addresses the rigidity of fixed voice and role in duplex speech models by introducing a hybrid system prompt that combines textual role conditioning with audio voice prompts. It integrates a Moshi-based full-duplex architecture with zero-shot voice cloning, trained on a large synthetic corpus generated from open-source LLMs and TTS, and evaluated on Full-Duplex-Bench and an extended multi-role Service-Duplex-Bench. Results show improved role adherence, speaker similarity, and dialog naturalness compared to state-of-the-art duplex and hybrid LLM-based speech systems, approaching the performance of closed commercial systems. The work enables scalable, personalized, role-conditioned conversational agents and provides released checkpoints with expanded data and refined prompting to boost naturalness and backchanneling.

Abstract

Recent advances in duplex speech models have enabled natural, low-latency speech-to-speech interactions. However, existing models are restricted to a fixed role and voice, limiting their ability to support structured, role-driven real-world applications and personalized interactions. In this work, we introduce PersonaPlex, a duplex conversational speech model that incorporates hybrid system prompts, combining role conditioning with text prompts and voice cloning with speech samples. PersonaPlex is trained on a large-scale synthetic dataset of paired prompts and user-agent conversations, generated with open-source large language models (LLM) and text-to-speech (TTS) models. To evaluate role conditioning in real-world settings, we extend the Full-Duplex-Bench benchmark beyond a single assistant role to multi-role customer service scenarios. Experiments show that PersonaPlex achieves strong role-conditioned behavior, voice-conditioned speech, and natural conversational responsiveness, surpassing state-of-the-art duplex speech models and hybrid large language model-based speech systems in role adherence, speaker similarity, latency, and naturalness.
Paper Structure (23 sections, 1 figure, 7 tables)

This paper contains 23 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: PersonaPlex's neural network is a duplex speech model based on Moshi moshi with a Hybrid System Prompt enabling textual prompts and voice cloning. The model then autoregressively generates text and audio while receiving live user audio.