Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations
Md Arafat Sultan, Jatin Ganhotra, Ramón Fernandez Astudillo
TL;DR
This work tackles the problem of closed-domain hallucinations in open-source LLMs when generating content-grounded multi-turn QA conversations. It introduces Structured Chain-of-Thought (SCoT) prompting, modeling the task as a state machine with four distinct states ($uu$, $ac$, $ss$, $au$) to modularize reading, reasoning, and generation, and to optionally leverage different tools in each state. Intrinsic evaluations show that explicit reading-oriented states significantly improve faithfulness to grounding documents, with up to 16.8% gains, while extrinsic evaluations demonstrate that synthetic data generated via SCOT can train competitive QA agents and even surpass target-domain gold data in some settings. The findings highlight the practical potential of state-aware, open-source prompting plus synthetic data augmentation to enable reliable content-grounded conversational QA without heavy instruction tuning or memorized knowledge. This approach offers a data-efficient path to robust conversational QA systems applicable to diverse domains and demonstrates how modular task decomposition can curb hallucinations in large language models.
Abstract
We introduce a structured chain-of-thought (SCoT) prompting approach to generating content-grounded multi-turn question-answer conversations using a pre-trained large language model (LLM). At the core of our proposal is a structured breakdown of the complex task into a number of states in a state machine, so that actions corresponding to various subtasks, e.g., content reading and utterance generation, can be executed in their own dedicated states. Each state leverages a unique set of resources including prompts and (optionally) additional tools to augment the generation process. Our experimental results show that SCoT prompting with designated states for hallucination mitigation increases agent faithfulness to grounding documents by up to 16.8%. When used as training data, our open-domain conversations synthesized from only 6 Wikipedia-based seed demonstrations train strong conversational QA agents; in out-of-domain evaluation, for example, we observe improvements of up to 13.9% over target domain gold data when the latter is augmented with our generated examples.
