Table of Contents
Fetching ...

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Donghang Wu, Tianyu Zhang, Yuxin Li, Hexin Liu, Chen Chen, Eng Siong Chng, Yoshua Bengio

Abstract

During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Abstract

During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
Paper Structure (25 sections, 16 equations, 3 figures, 5 tables)

This paper contains 25 sections, 16 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: (a) Turn-based interaction. The agent remains idle and starts to respond after the end of the user turn, but cannot be interrupted by user, as shown in the red box. (b) Full-duplex SDLM continuously listens to streaming speech input, supports user barge-in, and automatically switches between thinking and speaking like a human speaker.
  • Figure 2: The overview of proposed FLAIR. During the user’s speech phase, the LLM performs latent reasoning, using the LLM's output latent embeddings as the input for the next step. Once the user finishes speaking, the assistant autonomously decides when to respond; the LLM then executes an explicit forward pass, using text tokens as the input for the next step. When the user barges in, the LLM autonomously decides when to stop speaking and reverts to a state of latent reasoning.
  • Figure 3: The distribution of the input audio, target text, and latent reasoning embeddings. Specifically, the latent reasoning embeddings act as a bridge that connects the input audio with the target text.