Table of Contents
Fetching ...

SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

TL;DR

SHANKS introduces a thinking-while-listening framework for spoken language models by streaming user input in fixed chunks and generating unspoken reasoning in parallel with ongoing speech. Through two scenarios—interrupting erroneous step-by-step math solutions and making API calls while listening—SHANKS demonstrates reduced latency and proactive interactions, outperforming baselines that interrupt or call tools only after speech ends. The work provides end-to-end and cascade variants, plus scenario-specific baselines, and evaluates interruption quality, latency, and API-call performance using human-like judges and complex benchmarks. While promising for real-time, tool-augmented dialogue, SHANKS requires structured long-form speech, incurs extra compute, and invites further optimization of chunking and backchannel strategies.

Abstract

Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/

SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

TL;DR

SHANKS introduces a thinking-while-listening framework for spoken language models by streaming user input in fixed chunks and generating unspoken reasoning in parallel with ongoing speech. Through two scenarios—interrupting erroneous step-by-step math solutions and making API calls while listening—SHANKS demonstrates reduced latency and proactive interactions, outperforming baselines that interrupt or call tools only after speech ends. The work provides end-to-end and cascade variants, plus scenario-specific baselines, and evaluates interruption quality, latency, and API-call performance using human-like judges and complex benchmarks. While promising for real-time, tool-augmented dialogue, SHANKS requires structured long-form speech, incurs extra compute, and invites further optimization of chunking and backchannel strategies.

Abstract

Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/

Paper Structure

This paper contains 31 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: The timing diagram of Shanks. As the user speaks, their speech is segmented into chunks for every $t_{chunk}$ seconds and streamed to the SLM. After receiving an input chunk, Shanks generates the thinking tokens, which might include calling external tools or determining to interrupt the user. When the user is speaking the $i$-th speech chunk $S_i$, Shanks generates the $(i-1)$-th thinking chunk $R_{i-1}$, achieving thinking while listening. When the current speech chunk $S_{i}$ is fully spoken by the user, Shanks stops the thinking for $R_{i-1}$, adds the latest speech $S_{i}$ and the previous reasoning $R_{i-1}$ to its context, and begins the $i$-th thinking chunk $R_{i}$.
  • Figure 2: Illustration of the training data. $S_i$: the speech for the $i$-th user speech chunk; $R_i$: the $i$-th thinking block after $S_i$; $O$: the final response block; $A_i$: the API call responses after the speech chunk $S_i$. Blocks in dashed lines do not contribute to the training loss, while blocks in solid lines are included for loss calculation. (a) The general training sequence: Alternating between user speech block and SLM thinking token chunks (Section \ref{['Subsubsection: Training']}), followed by a final response chunk. (b) Training data with interruption: Alternating between user speech blocks and the thinking token chunks, while the last thinking chunk includes a special token [INTERRUPT]. (c) Training data with API calls: Similar to (a), while each thinking chunk may be separated into two blocks $R_{i-1}$ and $R_{i-2}$ by the API call response $A_i$.
  • Figure 3: An example from the interruption scenario in Section \ref{['Section: Application 2: API Call When Listening to Reduce Response Latency']}. The chunks in red are the transcriptions of a user describing a math problem and attempting to solve it step-by-step. The thinking chunks (in green) and interruption response (in orange) are generated by Shanks-E2E. For each time slot from $nt_{\rm chunk}$ to $(n+1)t_{\rm chunk}$, the chunks in green (SLM thinking chunks) and orange (output response) happen sequentially, while the user speech chunk (in red) happens concurrent to other blocks in the same time slot.
  • Figure 4: An example user query from ComplexFuncBench (in red), including the unspoken thinking process (in green) and the spoken final response (in orange) from Shanks-E2E. For each time slot from $nt_{\rm chunk}$ to $(n+1)t_{\rm chunk}$, the chunks in green (SLM thinking chunks), blue (API call responses), and orange (output response) happen sequentially, while the user speech chunk (in red) happens concurrent to other blocks in the same time slot. The $t=T$ means the time when the user's speech terminates.
  • Figure 5: The interruption latency for Shanks. The bars in red are the results on the wrong subset, while the bars in green are the results on the correct subset. One can observe that the red bars are mostly positive, meaning that the model tends to interrupt after the first error occurs.