Table of Contents
Fetching ...

KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

So Kuroki, Yotaro Kubo, Takuya Akiba, Yujin Tang

TL;DR

The paper tackles the tension between low-latency, end-to-end S2S speech generation and knowledge-rich, high-latency cascaded systems. It introduces KAME, a tandem architecture that pairs a front-end S2S module with a back-end text LLM, using streaming oracle tokens to infuse real-time responses with knowledge while keeping latency on par with baseline S2S models. A simulated oracle augmentation training regime enables realistic, time-varying guidance without requiring live back-end interactions, and experimental results on a speech-synthesized MT-Bench variant show KAME substantially improves knowledge-driven accuracy relative to Moshi while preserving responsiveness, though it remains slightly behind fully cascaded systems due to timing of oracle injections. The approach is back-end agnostic, enabling flexible integration of different LLMs and signaling a practical path toward knowledge-rich, low-latency conversational AI in real time.

Abstract

Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM's text-based response is then injected in real time to guide the S2S model's speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.

KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

TL;DR

The paper tackles the tension between low-latency, end-to-end S2S speech generation and knowledge-rich, high-latency cascaded systems. It introduces KAME, a tandem architecture that pairs a front-end S2S module with a back-end text LLM, using streaming oracle tokens to infuse real-time responses with knowledge while keeping latency on par with baseline S2S models. A simulated oracle augmentation training regime enables realistic, time-varying guidance without requiring live back-end interactions, and experimental results on a speech-synthesized MT-Bench variant show KAME substantially improves knowledge-driven accuracy relative to Moshi while preserving responsiveness, though it remains slightly behind fully cascaded systems due to timing of oracle injections. The approach is back-end agnostic, enabling flexible integration of different LLMs and signaling a practical path toward knowledge-rich, low-latency conversational AI in real time.

Abstract

Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM's text-based response is then injected in real time to guide the S2S model's speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.

Paper Structure

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Performance on MT-Bench vs. latency. KAME bridges the gap between low-latency, end-to-end S2S models (e.g., Moshi) and high-quality, cascaded systems (e.g., Unmute).
  • Figure 2: Proposed architecture for next-token generation.
  • Figure 3: An example of generating simulated oracle texts from a recorded conversation. As more of the user's input utterance (top) is revealed over time, the simulated oracle text (bottom) becomes progressively more accurate, and eventually converges to the recorded response (middle). The generation of the oracles is based on the hint level, which in turn depends on the completeness of the partial user speech.