Table of Contents
Fetching ...

Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM

Robin Shing-Hei Yuen, Timothy Tin-Long Tse, Jian Zhu

TL;DR

This work proposes a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities and reducing latency and improving the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions.

Abstract

Current speech-based LLMs are predominantly trained on extensive ASR and TTS datasets, excelling in tasks related to these domains. However, their ability to handle direct speech-to-speech conversations remains notably constrained. These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions. We also release a large-scale synthetic conversational dataset to facilitate further research.

Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM

TL;DR

This work proposes a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities and reducing latency and improving the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions.

Abstract

Current speech-based LLMs are predominantly trained on extensive ASR and TTS datasets, excelling in tasks related to these domains. However, their ability to handle direct speech-to-speech conversations remains notably constrained. These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions. We also release a large-scale synthetic conversational dataset to facilitate further research.
Paper Structure (24 sections, 2 equations, 3 figures, 6 tables)

This paper contains 24 sections, 2 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Illustration of ICoT training and generation structure (from A-T-T-A to A-T-A ASR ICoT). Tokens of audio transcripts are removed linearly from the start during training, compressing the generation length for faster inference.
  • Figure 2: Winrate (percentage) of different models’ generated responses compared to the proposed A-T-A (ASR ICoT) model, as evaluated by Prometheus and GPT-4o. A higher percentage indicates that the model outperformed the proposed method more often. The dotted line marks a 50% winrate (draw). While our model did not surpass the slower A-T-T-A chain-of-thought method, it outperformed most baseline models significantly, particularly those with similar or lower latency.
  • Figure 3: Winrate (percentage) of different models’ generated responses compared to the groundtruth, as evaluated by Prometheus and GPT-4o. A higher percentage indicates that the model outperformed the groundtruth more often. The dotted line marks a 50% winrate (draw). The results show consistent trends compared with Figure-\ref{['fig:winrate_vs_model']}.