Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations

Md Arafat Sultan; Jatin Ganhotra; Ramón Fernandez Astudillo

Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations

Md Arafat Sultan, Jatin Ganhotra, Ramón Fernandez Astudillo

TL;DR

This work tackles the problem of closed-domain hallucinations in open-source LLMs when generating content-grounded multi-turn QA conversations. It introduces Structured Chain-of-Thought (SCoT) prompting, modeling the task as a state machine with four distinct states ($uu$, $ac$, $ss$, $au$) to modularize reading, reasoning, and generation, and to optionally leverage different tools in each state. Intrinsic evaluations show that explicit reading-oriented states significantly improve faithfulness to grounding documents, with up to 16.8% gains, while extrinsic evaluations demonstrate that synthetic data generated via SCOT can train competitive QA agents and even surpass target-domain gold data in some settings. The findings highlight the practical potential of state-aware, open-source prompting plus synthetic data augmentation to enable reliable content-grounded conversational QA without heavy instruction tuning or memorized knowledge. This approach offers a data-efficient path to robust conversational QA systems applicable to diverse domains and demonstrates how modular task decomposition can curb hallucinations in large language models.

Abstract

We introduce a structured chain-of-thought (SCoT) prompting approach to generating content-grounded multi-turn question-answer conversations using a pre-trained large language model (LLM). At the core of our proposal is a structured breakdown of the complex task into a number of states in a state machine, so that actions corresponding to various subtasks, e.g., content reading and utterance generation, can be executed in their own dedicated states. Each state leverages a unique set of resources including prompts and (optionally) additional tools to augment the generation process. Our experimental results show that SCoT prompting with designated states for hallucination mitigation increases agent faithfulness to grounding documents by up to 16.8%. When used as training data, our open-domain conversations synthesized from only 6 Wikipedia-based seed demonstrations train strong conversational QA agents; in out-of-domain evaluation, for example, we observe improvements of up to 13.9% over target domain gold data when the latter is augmented with our generated examples.

Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations

TL;DR

) to modularize reading, reasoning, and generation, and to optionally leverage different tools in each state. Intrinsic evaluations show that explicit reading-oriented states significantly improve faithfulness to grounding documents, with up to 16.8% gains, while extrinsic evaluations demonstrate that synthetic data generated via SCOT can train competitive QA agents and even surpass target-domain gold data in some settings. The findings highlight the practical potential of state-aware, open-source prompting plus synthetic data augmentation to enable reliable content-grounded conversational QA without heavy instruction tuning or memorized knowledge. This approach offers a data-efficient path to robust conversational QA systems applicable to diverse domains and demonstrates how modular task decomposition can curb hallucinations in large language models.

Abstract

Paper Structure (20 sections, 7 figures, 7 tables)

This paper contains 20 sections, 7 figures, 7 tables.

Introduction
Preliminaries
Methods
Experiments
Intrinsic Evaluation and Analysis
Extrinsic Evaluation
Setup
Few-Shot Prompting
Supervised Fine-Tuning (sft)
Related Work
Conclusion
Prompts
Prompts for the Remaining States
Details of icl Demonstrations
sft Details
...and 5 more sections

Figures (7)

Figure 1: A multi-turn qa conversation grounded in a document. If the document does not have an answer to a user query, the agent acknowledges so in its response.
Figure 2: State machine for generating a single user-agent utterance pair within a multi-turn conversation (§\ref{['section:preliminaries']}). An action (incoming arrow label) is executed in every state by few-shot prompting an llm (§\ref{['section:methodology']}), and an output is generated (dotted arrows). One of multiple possible transitions then takes place (solid arrows), depending on the algorithm being run. A grounding document and a conversation history (not in the diagram) are present in all steps.
Figure 3: Prompts for states $\bm{au}$ and $\bm{ss}$. Left: Agent utterance generation ($\bm{au}$) with a pre-trained llm. Right: Answer sentence selection ($\bm{ss}$) with an instruction-following llm. This diagram only shows $1$-shot prompts for brevity; we use more demonstrations in practice (see Appendix \ref{['appendix-section:prompts']}).
Figure 4: Prompt for a pre-trained llm in state $\bm{uu}$: user utterance generation (§\ref{['section:methodology']}).
Figure 5: Prompt for an instruction-following llm assistant in state $\bm{ac}$: question answerability classification (§\ref{['section:methodology']}).
...and 2 more figures

Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations

TL;DR

Abstract

Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations

Authors

TL;DR

Abstract

Table of Contents

Figures (7)