Table of Contents
Fetching ...

ConvFill: Model Collaboration for Responsive Conversational Voice Agents

Vidya Srinivas, Zachary Englhardt, Maximus Powers, Shwetak Patel, Vikram Iyer

TL;DR

ConvFill introduces conversational infill, a hybrid on-device/off-device architecture that lets a lightweight model respond immediately while a backend LLM streams knowledge chunks to improve responses. The approach decouples latency from capability, achieving sub-200 ms TTFT and notable QA gains on NaturalQuestions (46–52%) though still below backend performance (69–80%). It relies on a synthetic, multi-domain training corpus and a two-thread inference pipeline with a streaming knowledge queue and a filler mechanism to hide latency. This work demonstrates a practical path toward responsive, knowledgeable on-device conversational agents and highlights future directions for grounding and larger on-device models.

Abstract

Deploying conversational voice agents with large language models faces a critical challenge: cloud-based foundation models provide deep reasoning and domain knowledge but introduce latency that disrupts natural conversation, while on-device models respond immediately but lack sophistication. We propose conversational infill, a task where a lightweight on-device model generates contextually appropriate dialogue while seamlessly incorporating streaming knowledge from a powerful backend model. This approach decouples response latency from model capability, enabling systems that feel responsive while accessing the full power of large-scale models. We present ConvFill, a 360M parameter model trained on synthetic multi-domain conversations. Evaluation across multiple backend models shows that conversational infill can be successfully learned, with ConvFill achieving accuracy improvements of 36-42% over standalone small models of the same size while consistently retaining sub-200ms response latencies. Our results demonstrate the promise of this approach for building on-device conversational agents that are both immediately responsive and knowledgeable.

ConvFill: Model Collaboration for Responsive Conversational Voice Agents

TL;DR

ConvFill introduces conversational infill, a hybrid on-device/off-device architecture that lets a lightweight model respond immediately while a backend LLM streams knowledge chunks to improve responses. The approach decouples latency from capability, achieving sub-200 ms TTFT and notable QA gains on NaturalQuestions (46–52%) though still below backend performance (69–80%). It relies on a synthetic, multi-domain training corpus and a two-thread inference pipeline with a streaming knowledge queue and a filler mechanism to hide latency. This work demonstrates a practical path toward responsive, knowledgeable on-device conversational agents and highlights future directions for grounding and larger on-device models.

Abstract

Deploying conversational voice agents with large language models faces a critical challenge: cloud-based foundation models provide deep reasoning and domain knowledge but introduce latency that disrupts natural conversation, while on-device models respond immediately but lack sophistication. We propose conversational infill, a task where a lightweight on-device model generates contextually appropriate dialogue while seamlessly incorporating streaming knowledge from a powerful backend model. This approach decouples response latency from model capability, enabling systems that feel responsive while accessing the full power of large-scale models. We present ConvFill, a 360M parameter model trained on synthetic multi-domain conversations. Evaluation across multiple backend models shows that conversational infill can be successfully learned, with ConvFill achieving accuracy improvements of 36-42% over standalone small models of the same size while consistently retaining sub-200ms response latencies. Our results demonstrate the promise of this approach for building on-device conversational agents that are both immediately responsive and knowledgeable.

Paper Structure

This paper contains 13 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Conversational Infill.ConvFill operates on the turn level while the backend model operates on the conversation level. The backend model outputs silence or a knowledge chunk (blue) and ConvFill incorporates these chunks along with conversational infill to generate the response to the user (orange).
  • Figure 2: Infill generation format. During training, ConvFill sees the user utterance, previous streamed knowledge chunks, and previous conversational phrases in an interleaved manner. ConvFill is trained to predict the last red utterance---its own last conversational phrase conditioned on the last external knowledge chunk (blue) and the phrase history.
  • Figure 3: Conversational infill inference example. The user asks about moving to a new city. ConvFill generates a response in an interleaved and streaming manner, in each conversational phrase (purple), referencing an external knowledge chunk (blue).