Table of Contents
Fetching ...

CoDi: Conversational Distillation for Grounded Question Answering

Patrick Huber, Arash Einolghozati, Rylan Conway, Kanika Narang, Matt Smith, Waqar Nayyar, Adithya Sagar, Ahmed Aly, Akshat Shrivastava

TL;DR

CoDi presents a data-centric approach to distilling conversational abilities into small language models by synthesizing large-scale, diverse, and steerable multi-turn conversations from black-box LLMs. By leveraging conversational graphs, per-turn links, and explicit linguistic phenomena, CoDi generates rich training data that enables 1B-scale SLMs to perform grounded question answering with performance near or above models trained on human multi-turn data, and to beat larger instruction-tuned baselines in zero-shot settings. The framework is evaluated on CoQA and QuAC, with intra-domain and web-scale (zero-shot) synthesis, and demonstrates robust gains in recall/F1, per-turn coherence, and zero-shot summarization tasks. These findings suggest CoDi can reduce reliance on costly human annotations while enabling effective on-device conversational agents for grounded reasoning. Practical impact includes improved on-device QA capabilities, scalable data generation pipelines, and a path toward more capable SLMs without extensive world-knowledge memorization in limited weights.

Abstract

Distilling conversational skills into Small Language Models (SLMs) with approximately 1 billion parameters presents significant challenges. Firstly, SLMs have limited capacity in their model parameters to learn extensive knowledge compared to larger models. Secondly, high-quality conversational datasets are often scarce, small, and domain-specific. Addressing these challenges, we introduce a novel data distillation framework named CoDi (short for Conversational Distillation, pronounced "Cody"), allowing us to synthesize large-scale, assistant-style datasets in a steerable and diverse manner. Specifically, while our framework is task agnostic at its core, we explore and evaluate the potential of CoDi on the task of conversational grounded reasoning for question answering. This is a typical on-device scenario for specialist SLMs, allowing for open-domain model responses, without requiring the model to "memorize" world knowledge in its limited weights. Our evaluations show that SLMs trained with CoDi-synthesized data achieve performance comparable to models trained on human-annotated data in standard metrics. Additionally, when using our framework to generate larger datasets from web data, our models surpass larger, instruction-tuned models in zero-shot conversational grounded reasoning tasks.

CoDi: Conversational Distillation for Grounded Question Answering

TL;DR

CoDi presents a data-centric approach to distilling conversational abilities into small language models by synthesizing large-scale, diverse, and steerable multi-turn conversations from black-box LLMs. By leveraging conversational graphs, per-turn links, and explicit linguistic phenomena, CoDi generates rich training data that enables 1B-scale SLMs to perform grounded question answering with performance near or above models trained on human multi-turn data, and to beat larger instruction-tuned baselines in zero-shot settings. The framework is evaluated on CoQA and QuAC, with intra-domain and web-scale (zero-shot) synthesis, and demonstrates robust gains in recall/F1, per-turn coherence, and zero-shot summarization tasks. These findings suggest CoDi can reduce reliance on costly human annotations while enabling effective on-device conversational agents for grounded reasoning. Practical impact includes improved on-device QA capabilities, scalable data generation pipelines, and a path toward more capable SLMs without extensive world-knowledge memorization in limited weights.

Abstract

Distilling conversational skills into Small Language Models (SLMs) with approximately 1 billion parameters presents significant challenges. Firstly, SLMs have limited capacity in their model parameters to learn extensive knowledge compared to larger models. Secondly, high-quality conversational datasets are often scarce, small, and domain-specific. Addressing these challenges, we introduce a novel data distillation framework named CoDi (short for Conversational Distillation, pronounced "Cody"), allowing us to synthesize large-scale, assistant-style datasets in a steerable and diverse manner. Specifically, while our framework is task agnostic at its core, we explore and evaluate the potential of CoDi on the task of conversational grounded reasoning for question answering. This is a typical on-device scenario for specialist SLMs, allowing for open-domain model responses, without requiring the model to "memorize" world knowledge in its limited weights. Our evaluations show that SLMs trained with CoDi-synthesized data achieve performance comparable to models trained on human-annotated data in standard metrics. Additionally, when using our framework to generate larger datasets from web data, our models surpass larger, instruction-tuned models in zero-shot conversational grounded reasoning tasks.
Paper Structure (36 sections, 7 figures, 10 tables)

This paper contains 36 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Conversational Graph Generation Example. Left: General conversational graph, Right: Rolled out version of a sampled graph at length n=4
  • Figure 2: A single generation step to synthesize a new turn in the conversation.
  • Figure 3: Per-turn conversational link augmentation with prompt and (optional) seed data
  • Figure 4: Example of linguistic phenomena used in the final turn prompt.
  • Figure 5: CoDi synthesized example. Context document taken from the CoQA training corpus.
  • ...and 2 more figures