Table of Contents
Fetching ...

Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts

Garrett Tanzer, Gustaf Ahdritz, Luke Melas-Kyriazi

TL;DR

Real-time interactive conversations with pretrained language models are hampered by traditional turn-taking. The authors propose modeling timed diarized transcripts and decoding with causal rejection sampling to synchronize generations with real-world time, validating the approach in instant messenger and spoken-conversation domains. They demonstrate feasibility across multiple model scales, report token-rate requirements and quality metrics, and release public code to reproduce the case studies. The work provides a scalable, data-efficient pathway to bring text-only LMs into real-time, streaming dialogue applications with potential impact on gaming, entertainment, and interactive agents.

Abstract

Chatbots built upon language models have exploded in popularity, but they have largely been limited to synchronous, turn-by-turn dialogues. In this paper we present a simple yet general method to simulate real-time interactive conversations using pretrained text-only language models, by modeling timed diarized transcripts and decoding them with causal rejection sampling. We demonstrate the promise of this method with two case studies: instant messenger dialogues and spoken conversations, which require generation at about 30 tok/s and 20 tok/s respectively to maintain real-time interactivity. These capabilities can be added into language models using relatively little data and run on commodity hardware.

Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts

TL;DR

Real-time interactive conversations with pretrained language models are hampered by traditional turn-taking. The authors propose modeling timed diarized transcripts and decoding with causal rejection sampling to synchronize generations with real-world time, validating the approach in instant messenger and spoken-conversation domains. They demonstrate feasibility across multiple model scales, report token-rate requirements and quality metrics, and release public code to reproduce the case studies. The work provides a scalable, data-efficient pathway to bring text-only LMs into real-time, streaming dialogue applications with potential impact on gaming, entertainment, and interactive agents.

Abstract

Chatbots built upon language models have exploded in popularity, but they have largely been limited to synchronous, turn-by-turn dialogues. In this paper we present a simple yet general method to simulate real-time interactive conversations using pretrained text-only language models, by modeling timed diarized transcripts and decoding them with causal rejection sampling. We demonstrate the promise of this method with two case studies: instant messenger dialogues and spoken conversations, which require generation at about 30 tok/s and 20 tok/s respectively to maintain real-time interactivity. These capabilities can be added into language models using relatively little data and run on commodity hardware.
Paper Structure (22 sections, 23 figures, 1 table, 3 algorithms)

This paper contains 22 sections, 23 figures, 1 table, 3 algorithms.

Figures (23)

  • Figure 1: Formatting for the instant messenger case study.
  • Figure 2: Formatting for the spoken conversation case study.
  • Figure 3: Statistics about the overhead of our control formats for instant messenger dialogues (top) and spoken conversations (bottom), and the requirements to maintain real-time interactivity.Left: Lengths (in Llama 2 tokens) of plaintext messages vs. control tokens for examples in the training set. Right: Fractions of the messages in the ground-truth dataset, including control tokens, that could be generated in real time for a given minimum generation rate, in tokens per second (again using the Llama 2 tokenizer). A message $m$ can be generated in real time if it can be generated in the time between the latest message outside of a short reaction window ($t_{react} = 200$ms) immediately before $m$, and $m$ itself. (We assume that for small $n$, the increase in cost for passing $n$ tokens through the network in parallel vs. 1 token is negligible, i.e. we are primarily modeling the cost of generating system responses, not ingesting user inputs.) For spoken conversations, we include performance figures for an optimized tokenizer which treats uses a single token for 3-digit timestamps.
  • Figure 4: Conversations generated by fine-tuned language models exhibit realistic message timings.Top: Log-binned histogram of the delays (in seconds) between successive messages in 512 independent 1000-token conversations generated unconditionally by fine-tuned Llama 2 7B (temperature 1, top-p=0.95 nucleus_sampling), compared to delays in a corresponding chunk of consecutive ground-truth messages of the same size sampled at random from the same month and year as the simulated ones. Mean conversation length is 73 messages. The empirical distributions are very similar (25-bin Kullback–Leibler divergence = 0.005), attributable to nucleus sampling. Bottom: Consecutive message delays for continuations of three randomly selected message history prefixes, ground truth (dotted) vs. predicted (solid). We do not expect these to perfectly match due to irreducible entropy, but the resemblance in trajectory shows that the model is not just learning first-order statistics.
  • Figure 5: Ground truth instant messenger example.
  • ...and 18 more figures