Table of Contents
Fetching ...

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

TL;DR

The paper addresses the latency and coherence gap in spoken language models by introducing Stitch, a generation framework that interleaves unspoken reasoning chunks with spoken output chunks to achieve simultaneous thinking and talking. Stitch-R (reasoning-first) and Stitch-S (speaking-first) leverage chunked reasoning tokens within fixed-length audio chunks, using timing to hide internal thought while producing speech. Across math-reasoning and knowledge datasets, Stitch variants outperform non-reasoning baselines by over 15% on reasoning tasks while maintaining comparable performance on non-reasoning tasks, with Stitch-S achieving zero additional latency relative to baselines. The approach demonstrates that enabling unspoken internal reasoning in SLMs can substantially improve answer quality without sacrificing responsiveness, pointing to practical improvements for real-time, reasoning-enabled dialogue systems.

Abstract

Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

TL;DR

The paper addresses the latency and coherence gap in spoken language models by introducing Stitch, a generation framework that interleaves unspoken reasoning chunks with spoken output chunks to achieve simultaneous thinking and talking. Stitch-R (reasoning-first) and Stitch-S (speaking-first) leverage chunked reasoning tokens within fixed-length audio chunks, using timing to hide internal thought while producing speech. Across math-reasoning and knowledge datasets, Stitch variants outperform non-reasoning baselines by over 15% on reasoning tasks while maintaining comparable performance on non-reasoning tasks, with Stitch-S achieving zero additional latency relative to baselines. The approach demonstrates that enabling unspoken internal reasoning in SLMs can substantially improve answer quality without sacrificing responsiveness, pointing to practical improvements for real-time, reasoning-enabled dialogue systems.

Abstract

Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.

Paper Structure

This paper contains 33 sections, 5 figures, 13 tables.

Figures (5)

  • Figure 1: The timing diagram during generation for Stitch-R. The model first generates the first $N_{\text{reason}}$ CoT reasoning tokens, $N_{\text{text}}$ text tokens, and $N_{\text{speech}}$ speech tokens. Once the first $N_{\text{speech}}$ speech tokens are generated, the speech decoder synthesizes the output audio that lasts $t_{\text{chunk}}$ seconds. When the speech waveform is synthesized and played to the user, the SLM uses this time to generate the next $N_{\text{reason}}$ reasoning tokens, $N_{\text{text}}$ text tokens, and $N_{\text{speech}}$ speech tokens and synthesize the speech output. The duration $t_{\text{chunk}}$ is much longer than the time for generating the text tokens and speech tokens corresponding to $S_{i}$, and we use the remaining time to generate the reasoning tokens.
  • Figure 2: Different generation method explored in this paper. The arrow represents the timeline for the SLM to generate the tokens; this timeline should not be confused with the timeline that the end user receives the audio, i.e., the upper timeline in Figure \ref{['fig:fig1']}. We plot tokens of the same type in a chunk using the same color. (a) GLM-4-Voice: Interleaving between text and speech token chunks (Section \ref{['section: Related Work: Spoken Language Models']}). This is the design of the original interleaved SLMs. (b) TBS: Generating a complete reasoning span and then interleaving between text and speech token chunks (Section \ref{['subsection: T2S2: Thinking in Text before Speaking in Speech']}). (c) Stitch-R: Alternating between reasoning token chunks, text token chunks, and speech token chunks (Section \ref{['subsection: T2S2-Interleave: Interleaving Text Reasoning CoT and Speech Output']}). (d) Stitch-S: Alternating between text token chunks, speech token chunks, and reasoning token chunks (Section \ref{['subsection: T2S2-Reverse-Interleave: Speech Span First before Partial Reasoning Span']}).
  • Figure 3: Figure \ref{['fig:Images/T2S2-I.pdf']} and \ref{['fig:Images/T2S2-RI.pdf']} show the accuracy when varying $N'_{token}$ for Stitch-R and Stitch-S, respectively, and the dots in the figure are the performance of the "no reasoning" baseline (Section \ref{['subsection: Adjust Length']}). Figure \ref{['fig:Images/augment_reasoning.pdf']} shows the performance when using a reasoning augmentation model to generate the text reasoning for Stitch-R (Section \ref{['subsection: Using text reasoning from Other Models']}); the accuracy is averaged over five math reasoning datasets.
  • Figure 4: The interface of the human evaluation.
  • Figure 5: The above two figures show the same STITCH-R model with different $t_{\rm syn}$. The upper one has a smaller $t_{\rm syn}$, while the lower one has a larger $t_{\rm syn}$. The figure is used to show that the speech decoder and the token generation of STITCH can be run in parallel. Consequently, as long as the $t_{\rm syn}<t_{\rm chunk}$, two audio chunks can be played seamlessly to the user. The difference in $t_{\rm syn}$ only affects the first packet latency.