Table of Contents
Fetching ...

Can Speech LLMs Think while Listening?

Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, Mike Seltzer

TL;DR

This work demonstrates that fine-tuning a multi-stream speech LLM with text-based chain-of-thought substantially improves spoken reasoning performance, achieving an average 2.4x accuracy gain on SRQA tasks. It introduces a thinking-while-listening framework guided by a Question Completeness metric to start reasoning earlier, and uses Direct Preference Optimization to balance accuracy and latency, yielding up to a 70% reduction in latency under tuned settings. Built on the Moshi multi-stream architecture, the approach interleaves CoT with streaming ASR within the text monologue stream, enabling concurrent listening, reasoning, and speaking. The results establish a strong accuracy improvement and a controllable accuracy-latency trade-off, signaling a practical path toward more responsive and cognitively enabled Speech LLMs.

Abstract

Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.

Can Speech LLMs Think while Listening?

TL;DR

This work demonstrates that fine-tuning a multi-stream speech LLM with text-based chain-of-thought substantially improves spoken reasoning performance, achieving an average 2.4x accuracy gain on SRQA tasks. It introduces a thinking-while-listening framework guided by a Question Completeness metric to start reasoning earlier, and uses Direct Preference Optimization to balance accuracy and latency, yielding up to a 70% reduction in latency under tuned settings. Built on the Moshi multi-stream architecture, the approach interleaves CoT with streaming ASR within the text monologue stream, enabling concurrent listening, reasoning, and speaking. The results establish a strong accuracy improvement and a controllable accuracy-latency trade-off, signaling a practical path toward more responsive and cognitively enabled Speech LLMs.

Abstract

Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.

Paper Structure

This paper contains 24 sections, 7 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Training token sequence arrangement. We train the model to interleave reasoning tokens $\mathcal{R}^{\mathrm{T}}$ with streaming ASR tokens $\mathcal{Q}^{\mathrm{T}}$ on the text monologue channel, with special switch tokens for mode switching. After the CoT ends, the model generates text tokens which align with the spoken response $\mathcal{R}^{\mathrm{T}}$. For simplicity, [PAD] and [EPAD] tokens are not shown here.
  • Figure 2: Examples of the Question Completeness curve $\zeta\left(p\right)$. In the first example, $\zeta$ reaches a high value at the end of the main question, at which point it is feasible to begin reasoning. In the second example, the word "Backcountry?" is critical to answer the question, and this is reflected in the corresponding $\zeta$ curve. More examples of the $\zeta$ curve are provided in Appendix \ref{['sec:qc_metric']}.
  • Figure 3: The framework for curating preference data for DPO. We generate outputs from the SFT model ($\pi_{\mathrm{ref}}$) by force-decoding <start_cot> early (e.g., before "on which river" is spoken). The preferred response ($y_w$) is the one where the model is able to adaptively generate a correct and shorter reasoning trace.
  • Figure 4: Effect of streaming user ASR on accuracy for SRQA tasks. As we increase look-ahead, the accuracy improves and approaches the "offline ASR" topline.
  • Figure 5: Accuracy-latency curves for the proposed methods on SRQA reasoning tasks. QC exhibits better controllability in trade-offs. DPO training with correctness-based preference further improves the accuracy of the QC models.
  • ...and 1 more figures