Table of Contents
Fetching ...

Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents

Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, Shyamnath Gollakota

TL;DR

Synchronous LLMs tackle the challenge of turning static, text-based LLMs into full-duplex spoken dialogue agents by integrating wall-clock timing through periodic synchronization tokens and deduplicated HuBERT token sequences. A three-stage training pipeline uses abundant synthetic speech derived from text dialogues and a small real spoken-dialogue dataset to teach timing, backchannels, and overlaps, built on Llama3-8b with a 501 token HuBERT vocabulary. Empirical results show SyncLLM outperforms the state-of-the-art dGSLM in dialogue meaningfulness while maintaining natural turn-taking and demonstrating latency tolerance up to $240~\mathrm{ms}$ in simulated full-duplex interactions, including LLM–LLM dialogues across different datasets. The approach enables latency-tolerant, streaming full-duplex voice interfaces that leverage large-scale text pretraining, with broad implications for real-time conversational AI.

Abstract

Despite broad interest in modeling spoken dialogue agents, most approaches are inherently "half-duplex" -- restricted to turn-based interaction with responses requiring explicit prompting by the user or implicit tracking of interruption or silence events. Human dialogue, by contrast, is "full-duplex" allowing for rich synchronicity in the form of quick and dynamic turn-taking, overlapping speech, and backchanneling. Technically, the challenge of achieving full-duplex dialogue with LLMs lies in modeling synchrony as pre-trained LLMs do not have a sense of "time". To bridge this gap, we propose Synchronous LLMs for full-duplex spoken dialogue modeling. We design a novel mechanism to integrate time information into Llama3-8b so that they run synchronously with the real-world clock. We also introduce a training recipe that uses 212k hours of synthetic spoken dialogue data generated from text dialogue data to create a model that generates meaningful and natural spoken dialogue, with just 2k hours of real-world spoken dialogue data. Synchronous LLMs outperform state-of-the-art in dialogue meaningfulness while maintaining naturalness. Finally, we demonstrate the model's ability to participate in full-duplex dialogue by simulating interaction between two agents trained on different datasets, while considering Internet-scale latencies of up to 240 ms. Webpage: https://syncllm.cs.washington.edu/.

Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents

TL;DR

Synchronous LLMs tackle the challenge of turning static, text-based LLMs into full-duplex spoken dialogue agents by integrating wall-clock timing through periodic synchronization tokens and deduplicated HuBERT token sequences. A three-stage training pipeline uses abundant synthetic speech derived from text dialogues and a small real spoken-dialogue dataset to teach timing, backchannels, and overlaps, built on Llama3-8b with a 501 token HuBERT vocabulary. Empirical results show SyncLLM outperforms the state-of-the-art dGSLM in dialogue meaningfulness while maintaining natural turn-taking and demonstrating latency tolerance up to in simulated full-duplex interactions, including LLM–LLM dialogues across different datasets. The approach enables latency-tolerant, streaming full-duplex voice interfaces that leverage large-scale text pretraining, with broad implications for real-time conversational AI.

Abstract

Despite broad interest in modeling spoken dialogue agents, most approaches are inherently "half-duplex" -- restricted to turn-based interaction with responses requiring explicit prompting by the user or implicit tracking of interruption or silence events. Human dialogue, by contrast, is "full-duplex" allowing for rich synchronicity in the form of quick and dynamic turn-taking, overlapping speech, and backchanneling. Technically, the challenge of achieving full-duplex dialogue with LLMs lies in modeling synchrony as pre-trained LLMs do not have a sense of "time". To bridge this gap, we propose Synchronous LLMs for full-duplex spoken dialogue modeling. We design a novel mechanism to integrate time information into Llama3-8b so that they run synchronously with the real-world clock. We also introduce a training recipe that uses 212k hours of synthetic spoken dialogue data generated from text dialogue data to create a model that generates meaningful and natural spoken dialogue, with just 2k hours of real-world spoken dialogue data. Synchronous LLMs outperform state-of-the-art in dialogue meaningfulness while maintaining naturalness. Finally, we demonstrate the model's ability to participate in full-duplex dialogue by simulating interaction between two agents trained on different datasets, while considering Internet-scale latencies of up to 240 ms. Webpage: https://syncllm.cs.washington.edu/.
Paper Structure (21 sections, 8 figures, 7 tables)

This paper contains 21 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: SyncLLM as a full-duplex dialogue agent. At current time step (chunk N in the figure), SyncLLM's context contains interleaved chunks of the LLM's speech until the current chunk, and the user's speech corresponding to all but the current chunk. To be in synchrony with the user, the LLM must generate its next chunk (chunk N+1) before the end of the current chunk. As a result, SyncLLM first generates an estimated user's chunk, which is in-turn appended to the context and used to predict its next chunk.
  • Figure 2: SyncLLM's token sequence format visualized with a chunk size of 160 ms. (Top row) We represent spoken dialogue as interleaved chunks of HuBERT tokens, where the chunk size determines the frequency of the synchronization token [S0]. (Middle row) We train SyncLLM to generate interleaved chunks of deduplicated HuBERT tokens along with periodic synchronization tokens. (Bottom row) We interpolate deduplicated tokens in each chunk to obtain spoken dialogue sequence in the original format.
  • Figure 3: Tokens required for representing a second of speech with/without deduplication. Histogram computed over 15 hr of dialog data in the Fisher dataset Cieri2004TheFC.
  • Figure 4: We sample speech percentages from truncated normal distribution, so we obtain samples with all possible combinations of text-speech interleaving throughout the training process, with a bias for higher speech percentages as the training progresses. This resulted in stable training when starting out with a text-only LLM.
  • Figure 5: Perplexity of transcriptions of spoken dialogues generated by different models. Perplexity is measured with respect to a text dialogue model's predictions.
  • ...and 3 more figures