Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models
Donghang Wu, Haoyang Zhang, Jun Chen, Xiangyu, Zhang, Hexin Liu, Eng Siong Chng, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
TL;DR
This work tackles the latency of reasoning in real-time Spoken Language Models by proposing Mind-Paced Speaking (MPS), a dual-brain framework with a Formulation Brain that continuously generates thinking content and an Articulation Brain that converts partial thinking into fluent speech. A think-incomplete supervised fine-tuning method enables the Articulation Brain to respond from partial CoT content, and two operation modes—Think-First and Speak-First—offer flexible latency-performance trade-offs. Empirical results on Spoken-MQA and URO-Bench show MPS outperforms direct-response baselines and existing think-while-speaking methods, achieving high accuracy while dramatically reducing or even eliminating latency in the Speak-First variant. The approach bridges high-quality reasoning and real-time interaction, delivering a neuroscience-inspired paradigm for coherent, real-time dialogue in SLMs.
Abstract
Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a "Formulation Brain" for high-level reasoning to pace and guide a separate "Articulation Brain" for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. Our work effectively bridges the gap between high-quality reasoning and real-time interaction.
