Table of Contents
Fetching ...

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

Wendi Cui, Jiaxin Zhang, Zhuohang Li, Lopez Damien, Kamalika Das, Bradley Malin, Sricharan Kumar

TL;DR

The paper introduces DCR, a Divide-Conquer-Reasoning framework for evaluating and improving the consistency of large language model outputs. It decomposes paragraph-level comparisons into sentence-level assessments via a Divide-Conquer Evaluator (DCE), converts reasons into a numeric score with an Auto-Metric Converter (AMC), and uses a Reason-Assisted Improver (RAI) to generate corrected sentences, enabling multi-round improvements. Across semantic, factual, and summarization tasks, DCE-AMC achieves higher alignment with human judgments than state-of-the-art baselines, with notable gains on SummEval (+19.3% and +24.3% vs G-Eval-4) and substantial reductions in inconsistencies (up to ~90% with RAI). The work demonstrates the practicality of sentence-level evaluation and reasoning-based remediation, offering a scalable approach to mitigating hallucinations and enhancing reliability in high-stakes NLG applications, while acknowledging limitations such as task-specific prompting and reference dependence.

Abstract

Evaluating the quality and variability of text generated by Large Language Models (LLMs) poses a significant, yet unresolved research challenge. Traditional evaluation methods, such as ROUGE and BERTScore, which measure token similarity, often fail to capture the holistic semantic equivalence. This results in a low correlation with human judgments and intuition, which is especially problematic in high-stakes applications like healthcare and finance where reliability, safety, and robust decision-making are highly critical. This work proposes DCR, an automated framework for evaluating and improving the consistency of LLM-generated texts using a divide-conquer-reasoning approach. Unlike existing LLM-based evaluators that operate at the paragraph level, our method employs a divide-and-conquer evaluator (DCE) that breaks down the paragraph-to-paragraph comparison between two generated responses into individual sentence-to-paragraph comparisons, each evaluated based on predefined criteria. To facilitate this approach, we introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Beyond the consistency evaluation, we further present a reason-assisted improver (RAI) that leverages the analytical reasons with explanations identified by DCE to generate new responses aimed at reducing these inconsistencies. Through comprehensive and systematic empirical analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +19.3% and +24.3% on the SummEval dataset) in evaluating the consistency of LLM generation across multiple benchmarks in semantic, factual, and summarization consistency tasks. Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

TL;DR

The paper introduces DCR, a Divide-Conquer-Reasoning framework for evaluating and improving the consistency of large language model outputs. It decomposes paragraph-level comparisons into sentence-level assessments via a Divide-Conquer Evaluator (DCE), converts reasons into a numeric score with an Auto-Metric Converter (AMC), and uses a Reason-Assisted Improver (RAI) to generate corrected sentences, enabling multi-round improvements. Across semantic, factual, and summarization tasks, DCE-AMC achieves higher alignment with human judgments than state-of-the-art baselines, with notable gains on SummEval (+19.3% and +24.3% vs G-Eval-4) and substantial reductions in inconsistencies (up to ~90% with RAI). The work demonstrates the practicality of sentence-level evaluation and reasoning-based remediation, offering a scalable approach to mitigating hallucinations and enhancing reliability in high-stakes NLG applications, while acknowledging limitations such as task-specific prompting and reference dependence.

Abstract

Evaluating the quality and variability of text generated by Large Language Models (LLMs) poses a significant, yet unresolved research challenge. Traditional evaluation methods, such as ROUGE and BERTScore, which measure token similarity, often fail to capture the holistic semantic equivalence. This results in a low correlation with human judgments and intuition, which is especially problematic in high-stakes applications like healthcare and finance where reliability, safety, and robust decision-making are highly critical. This work proposes DCR, an automated framework for evaluating and improving the consistency of LLM-generated texts using a divide-conquer-reasoning approach. Unlike existing LLM-based evaluators that operate at the paragraph level, our method employs a divide-and-conquer evaluator (DCE) that breaks down the paragraph-to-paragraph comparison between two generated responses into individual sentence-to-paragraph comparisons, each evaluated based on predefined criteria. To facilitate this approach, we introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Beyond the consistency evaluation, we further present a reason-assisted improver (RAI) that leverages the analytical reasons with explanations identified by DCE to generate new responses aimed at reducing these inconsistencies. Through comprehensive and systematic empirical analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +19.3% and +24.3% on the SummEval dataset) in evaluating the consistency of LLM generation across multiple benchmarks in semantic, factual, and summarization consistency tasks. Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
Paper Structure (32 sections, 5 equations, 5 figures, 18 tables, 1 algorithm)

This paper contains 32 sections, 5 equations, 5 figures, 18 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) Overview of the proposed DCR framework. The first two components (DCE-AMC) aim at providing a better strategy for evaluating and quantifying semantic consistency to best match human judgments. Building on this, a third component RAI further utilizes analytical reasoning to iteratively improve the consistency of LLM-generated content w.r.t. the reference by minimizing hallucinations. (b) The combination of DCE and AMC (DCE-AMC-4) significantly outperforms the baseline methods in terms of correlations with human ratings. (c) RAI substantially reduces output inconsistencies by $\sim90\%$ through a single improvement iteration on SummEval and QAGS benchmarks.
  • Figure 2: An example of evaluating and improving the consistency of generated text via DCR.
  • Figure 3: F1 score, precision, and recall performance of our method on sentence-level and paragraph-level evaluations. We also verify the effectiveness of the auto-metric converter.
  • Figure 4: Multi-round consistency improvement
  • Figure 5: Computational cost.