Table of Contents
Fetching ...

CER: Confidence Enhanced Reasoning in LLMs

Ali Razghandi, Seyed Mohammad Hadi Hosseini, Mahdieh Soleymani Baghshah

TL;DR

This work tackles the reliability of large language models in complex, multi-step reasoning and knowledge-intensive generation. It introduces Confidence Enhanced Reasoning (CER), an uncertainty-aware framework that measures confidence on critical intermediate tokens, aggregates path-level confidences with a weighted scheme, and ensembles final answers by overall path reliability rather than simple self-consistency. CER is evaluated across five datasets and four LLMs, showing up to $7.4\%$ math and $5.8\%$ open-domain gains without any model fine-tuning, with extensive ablations validating the importance of intermediate signals, path count, and aggregation choices. The approach offers a lightweight, plug-in improvement to reasoning reliability that can be applied to diverse model families and task domains, potentially impacting AI systems that rely on robust, verifiable reasoning from LLMs.

Abstract

Ensuring the reliability of Large Language Models (LLMs) in complex reasoning tasks remains a formidable challenge, particularly in scenarios that demand precise mathematical calculations and knowledge-intensive open-domain generation. In this work, we introduce an uncertainty-aware framework designed to enhance the accuracy of LLM responses by systematically incorporating model confidence at critical decision points. We propose an approach that encourages multi-step reasoning in LLMs and quantify the confidence of intermediate answers such as numerical results in mathematical reasoning and proper nouns in open-domain generation. Then, the overall confidence of each reasoning chain is evaluated based on confidence of these critical intermediate steps. Finally, we aggregate the answer of generated response paths in a way that reflects the reliability of each generated content (as opposed to self-consistency in which each generated chain contributes equally to majority voting). We conducted extensive experiments in five datasets, three mathematical datasets and two open-domain datasets, using four LLMs. The results consistently validate the effectiveness of our novel confidence aggregation method, leading to an accuracy improvement of up to 7.4% and 5.8% over baseline approaches in math and open-domain generation tasks, respectively. Code is publicly available at https://github.com/ Aquasar11/CER.

CER: Confidence Enhanced Reasoning in LLMs

TL;DR

This work tackles the reliability of large language models in complex, multi-step reasoning and knowledge-intensive generation. It introduces Confidence Enhanced Reasoning (CER), an uncertainty-aware framework that measures confidence on critical intermediate tokens, aggregates path-level confidences with a weighted scheme, and ensembles final answers by overall path reliability rather than simple self-consistency. CER is evaluated across five datasets and four LLMs, showing up to math and open-domain gains without any model fine-tuning, with extensive ablations validating the importance of intermediate signals, path count, and aggregation choices. The approach offers a lightweight, plug-in improvement to reasoning reliability that can be applied to diverse model families and task domains, potentially impacting AI systems that rely on robust, verifiable reasoning from LLMs.

Abstract

Ensuring the reliability of Large Language Models (LLMs) in complex reasoning tasks remains a formidable challenge, particularly in scenarios that demand precise mathematical calculations and knowledge-intensive open-domain generation. In this work, we introduce an uncertainty-aware framework designed to enhance the accuracy of LLM responses by systematically incorporating model confidence at critical decision points. We propose an approach that encourages multi-step reasoning in LLMs and quantify the confidence of intermediate answers such as numerical results in mathematical reasoning and proper nouns in open-domain generation. Then, the overall confidence of each reasoning chain is evaluated based on confidence of these critical intermediate steps. Finally, we aggregate the answer of generated response paths in a way that reflects the reliability of each generated content (as opposed to self-consistency in which each generated chain contributes equally to majority voting). We conducted extensive experiments in five datasets, three mathematical datasets and two open-domain datasets, using four LLMs. The results consistently validate the effectiveness of our novel confidence aggregation method, leading to an accuracy improvement of up to 7.4% and 5.8% over baseline approaches in math and open-domain generation tasks, respectively. Code is publicly available at https://github.com/ Aquasar11/CER.

Paper Structure

This paper contains 39 sections, 6 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of Confidence-Enhanced Reasoning (CER) in LLMs. On the left, we demonstrate the CER framework. Given an input query, the LLM generates three independent outputs using temperature sampling ($T = 1$). Intermediate answers are bolded, and final answers are highlighted. The confidence of each output is computed, and the most weighted-confident answer—125—is selected. On the right, we illustrate the confidence calculation for the first output. We use multiplication as the step-wise aggregator function ($f$) and weighted averaging ($wa$) as the path-wise aggregator function ($g$). Since the answer 125 appears in both step 4 and the final answer, we mark its first occurrence with * for clarity. The full question and responses from the LLM are provided in Appendix \ref{['appendix:F']}.
  • Figure 2: Performance comparison of CER and baseline models across different generations $K = \{3, 5, 10\}$ on the LLAMA 3.3-3B model using the MATH dataset.
  • Figure 3: Ablation study results comparing the performance of the CER method using the last answer confidence (CER-LAST, red) versus the original CER method utilizing all intermediate answers (CER-ALL, blue) across mathematical reasoning datasets (GSM8K, MATH, MathQA) and open-domain question-answering datasets (TriviaQA, HotpotQA). The left side presents results for LLaMA-3.1-8B, while the right side shows results for Mistral-7B. Across all datasets, CER-ALL consistently outperforms CER-LAST, emphasizing the advantage of incorporating intermediate answers for improved accuracy.
  • Figure 4: Prompt for Math Reasoning
  • Figure 5: Prompt for Multi-hop Reasoning
  • ...and 2 more figures