CER: Confidence Enhanced Reasoning in LLMs
Ali Razghandi, Seyed Mohammad Hadi Hosseini, Mahdieh Soleymani Baghshah
TL;DR
This work tackles the reliability of large language models in complex, multi-step reasoning and knowledge-intensive generation. It introduces Confidence Enhanced Reasoning (CER), an uncertainty-aware framework that measures confidence on critical intermediate tokens, aggregates path-level confidences with a weighted scheme, and ensembles final answers by overall path reliability rather than simple self-consistency. CER is evaluated across five datasets and four LLMs, showing up to $7.4\%$ math and $5.8\%$ open-domain gains without any model fine-tuning, with extensive ablations validating the importance of intermediate signals, path count, and aggregation choices. The approach offers a lightweight, plug-in improvement to reasoning reliability that can be applied to diverse model families and task domains, potentially impacting AI systems that rely on robust, verifiable reasoning from LLMs.
Abstract
Ensuring the reliability of Large Language Models (LLMs) in complex reasoning tasks remains a formidable challenge, particularly in scenarios that demand precise mathematical calculations and knowledge-intensive open-domain generation. In this work, we introduce an uncertainty-aware framework designed to enhance the accuracy of LLM responses by systematically incorporating model confidence at critical decision points. We propose an approach that encourages multi-step reasoning in LLMs and quantify the confidence of intermediate answers such as numerical results in mathematical reasoning and proper nouns in open-domain generation. Then, the overall confidence of each reasoning chain is evaluated based on confidence of these critical intermediate steps. Finally, we aggregate the answer of generated response paths in a way that reflects the reliability of each generated content (as opposed to self-consistency in which each generated chain contributes equally to majority voting). We conducted extensive experiments in five datasets, three mathematical datasets and two open-domain datasets, using four LLMs. The results consistently validate the effectiveness of our novel confidence aggregation method, leading to an accuracy improvement of up to 7.4% and 5.8% over baseline approaches in math and open-domain generation tasks, respectively. Code is publicly available at https://github.com/ Aquasar11/CER.
