Table of Contents
Fetching ...

Multi-Document Grounded Multi-Turn Synthetic Dialog Generation

Young-Suk Lee, Chulaka Gunasekara, Danish Contractor, Ramón Fernandez Astudillo, Radu Florian

TL;DR

Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.

Abstract

We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.

Multi-Document Grounded Multi-Turn Synthetic Dialog Generation

TL;DR

Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.

Abstract

We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.
Paper Structure (18 sections, 11 figures, 10 tables, 2 algorithms)

This paper contains 18 sections, 11 figures, 10 tables, 2 algorithms.

Figures (11)

  • Figure 1: Overview of document grounded multi-turn synthetic dialog generation pipeline. We distinguish two types of dialog style, single-document grounded (light green color boxes) and multi-document grounded (retrieval augmented generation, pink color boxes). Both styles share the same starting-turn query taxonomy (ST-QT), CoT prompt for the initial query (Query$_{1}$) generation, multi-turn query taxonomy (MT-QT) and the CoT prompt for second turn query (Query$_{2}$) generation. User queries and agent answers are generated by an LLM. Given the initial query generated from the same single document, single-document grounded dialog generation proceeds according to Algorithm \ref{['alg:mrc-dialog-algorithm']} and multi-document-grounded, according to Algorithm \ref{['alg:rag-dialog-algorithm']} in §\ref{['sec:dialog-flow']}. After generating multi-turn dialogs, we apply LLM-as-a-Judge to filter out queries with incorrect answers turn-by-turn. We use Mixtral-8x7b-instruct as our language model for both data generation and LLM-as-a-Judge.
  • Figure 2: Winrate of two sets of fine-tuned models on the test sets judged by human annotators.
  • Figure 3: Chain-of-Thought prompt used for generating direct questions
  • Figure 5: Chain-of-Thought prompt used for generating aggregate questions
  • Figure 7: Chain-of-Thought prompt used for generating conversations with follow-up questions
  • ...and 6 more figures