Improving Consistency in Large Language Models through Chain of Guidance
Harsh Raj, Vipul Gupta, Domenic Rosati, Subhabrata Majumdar
TL;DR
This work tackles the problem of semantic inconsistency in large language models by introducing Chain of Guidance (CoG), a multi-step prompting framework that generates paraphrase-dominated QA data and ranks candidate answers to enforce semantic alignment. CoG uses guided paraphrase generation, guided answer generation, and an in-context ranking step to produce expanded, consistent QA pairs, which are then used to fine-tune smaller models via LoRA or full SFT. Empirical results show substantial gains in semantic consistency across multiple LLMs, with up to 49% improvement on entailment-based metrics, and demonstrate generalization to unseen datasets; human studies corroborate alignment between metrics and judgments. The approach maintains competitive performance on non-QA tasks and reveals a practical, modular pathway to improve trustworthiness in LLM-based systems by combining synthetic data generation with targeted fine-tuning. The Modular CoG framework also opens avenues for extending alignment objectives beyond consistency, such as fairness and safety, through task-specific prompt templates and evaluation metrics.
Abstract
Consistency is a fundamental dimension of trustworthiness in Large Language Models (LLMs). For humans to be able to trust LLM-based applications, their outputs should be consistent when prompted with inputs that carry the same meaning or intent. Despite this need, there is no known mechanism to control and guide LLMs to be more consistent at inference time. In this paper, we introduce a novel alignment strategy to maximize semantic consistency in LLM outputs. Our proposal is based on Chain of Guidance (CoG), a multistep prompting technique that generates highly consistent outputs from LLMs. For closed-book question-answering (Q&A) tasks, when compared to direct prompting, the outputs generated using CoG show improved consistency. While other approaches like template-based responses and majority voting may offer alternative paths to consistency, our work focuses on exploring the potential of guided prompting. We use synthetic data sets comprised of consistent input-output pairs to fine-tune LLMs to produce consistent and correct outputs. Our fine-tuned models are more than twice as consistent compared to base models and show strong generalization capabilities by producing consistent outputs over datasets not used in the fine-tuning process.
