Table of Contents
Fetching ...

Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

Donghang Wu, Haoyang Zhang, Jun Chen, Xiangyu, Zhang, Hexin Liu, Eng Siong Chng, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

TL;DR

This work tackles the latency of reasoning in real-time Spoken Language Models by proposing Mind-Paced Speaking (MPS), a dual-brain framework with a Formulation Brain that continuously generates thinking content and an Articulation Brain that converts partial thinking into fluent speech. A think-incomplete supervised fine-tuning method enables the Articulation Brain to respond from partial CoT content, and two operation modes—Think-First and Speak-First—offer flexible latency-performance trade-offs. Empirical results on Spoken-MQA and URO-Bench show MPS outperforms direct-response baselines and existing think-while-speaking methods, achieving high accuracy while dramatically reducing or even eliminating latency in the Speak-First variant. The approach bridges high-quality reasoning and real-time interaction, delivering a neuroscience-inspired paradigm for coherent, real-time dialogue in SLMs.

Abstract

Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a "Formulation Brain" for high-level reasoning to pace and guide a separate "Articulation Brain" for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. Our work effectively bridges the gap between high-quality reasoning and real-time interaction.

Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

TL;DR

This work tackles the latency of reasoning in real-time Spoken Language Models by proposing Mind-Paced Speaking (MPS), a dual-brain framework with a Formulation Brain that continuously generates thinking content and an Articulation Brain that converts partial thinking into fluent speech. A think-incomplete supervised fine-tuning method enables the Articulation Brain to respond from partial CoT content, and two operation modes—Think-First and Speak-First—offer flexible latency-performance trade-offs. Empirical results on Spoken-MQA and URO-Bench show MPS outperforms direct-response baselines and existing think-while-speaking methods, achieving high accuracy while dramatically reducing or even eliminating latency in the Speak-First variant. The approach bridges high-quality reasoning and real-time interaction, delivering a neuroscience-inspired paradigm for coherent, real-time dialogue in SLMs.

Abstract

Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a "Formulation Brain" for high-level reasoning to pace and guide a separate "Articulation Brain" for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. Our work effectively bridges the gap between high-quality reasoning and real-time interaction.

Paper Structure

This paper contains 17 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Architecture of the TBS architecture. For the sake of conciseness, we remove the input text, which is optional in SLMs. The TBS SLM first generates the full CoT and then produces response tokens.
  • Figure 2: Architecture of the proposed MPS. For the sake of conciseness, we remove the input text, which is optional in SLMs. We demonstrate the process from step i to step i+1 when generating think segments and response segments. The Formulation Brain LLM continuously generates the think segments. The newly generated think segment and the response segment from the previous step are both added as the prefix to the Articulation Brain LLM, pacing the Articulation Brain LLM to produce response segment correspondingly.
  • Figure 3: An example of the output of MPS-spkfirst on the Spoken-MQA dataset. The Articulation Brain first generates a response segment. Simutaneously, Formulation Brain continuously generates new think segments, and each newly generated think segment is prefixed to the Articulation Brain, pacing it to generate new response segment.