Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models

Zhangyue Yin; Qiushi Sun; Qipeng Guo; Zhiyuan Zeng; Xiaonan Li; Tianxiang Sun; Cheng Chang; Qinyuan Cheng; Ding Wang; Xiaofeng Mou; Xipeng Qiu; XuanJing Huang

Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Xiaonan Li, Tianxiang Sun, Cheng Chang, Qinyuan Cheng, Ding Wang, Xiaofeng Mou, Xipeng Qiu, XuanJing Huang

TL;DR

This work addresses the limitation of majority-vote ensembling when correct reasoning chains are outnumbered by incorrect ones. It introduces AoR, a hierarchical Aggregation of Reasoning framework that evaluates reasoning chains via a two-phase local-scoring and global-evaluation process, augmented by dynamic sampling to adapt to task complexity. Empirical results across mathematical, commonsense, and symbolic tasks show AoR consistently outperforms strong baselines and achieves a higher performance ceiling, while reducing computational overhead. The approach demonstrates robust gains across diverse LLMs and prompts, highlighting the practical impact of reasoning-chain evaluation for reliable answer selection.

Abstract

Recent advancements in Chain-of-Thought prompting have facilitated significant breakthroughs for Large Language Models (LLMs) in complex reasoning tasks. Current research enhances the reasoning performance of LLMs by sampling multiple reasoning chains and ensembling based on the answer frequency. However, this approach fails in scenarios where the correct answers are in the minority. We identify this as a primary factor constraining the reasoning capabilities of LLMs, a limitation that cannot be resolved solely based on the predicted answers. To address this shortcoming, we introduce a hierarchical reasoning aggregation framework AoR (Aggregation of Reasoning), which selects answers based on the evaluation of reasoning chains. Additionally, AoR incorporates dynamic sampling, adjusting the number of reasoning chains in accordance with the complexity of the task. Experimental results on a series of complex reasoning tasks show that AoR outperforms prominent ensemble methods. Further analysis reveals that AoR not only adapts various LLMs but also achieves a superior performance ceiling when compared to current methods.

Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models

TL;DR

Abstract

Paper Structure (42 sections, 5 equations, 14 figures, 3 tables)

This paper contains 42 sections, 5 equations, 14 figures, 3 tables.

Introduction
Related work
Reasoning with Chain-of-Thought.
Ensemble of Multiple Reasoning Chains.
Evaluation Capability of LLMs.
Preliminary
Standard Prompting.
CoT Prompting.
Self-Consistency.
Methodology
Overview
Local-Scoring.
Global-Evaluation.
Dynamic Sampling
Experiment
...and 27 more sections

Figures (14)

Figure 1: An illustrative example from AQuA ling2017aqua, with 5 reasoning chains generated through temperature sampling. Although LLM is able to generate the correct answer, majority voting ultimately selects an incorrect answer due to the abundance of incorrect answers.
Figure 2: Proportion of samples that correct answers appearing in LLMs' generations among those where majority voting results in an incorrect outcome across various reasoning tasks.
Figure 3: An illustrative example detailing the AoR workflow. Initially, 10 reasoning chains are sampled. During the local-scoring phase, reasoning chains with identical answers are compared, filtering out high-quality chains $\mathcal{R}_1$, $\mathcal{R}_2$, $\mathcal{R}_3$, and $\mathcal{R}_8$ for global evaluation. In the global-evaluation phase, $\mathcal{R}_2$ receives the highest score, but the score margin between $\mathcal{R}_2$ and $\mathcal{R}_3$ fails to surpass the threshold $\theta$.
Figure 4: Illustration of the dynamic sampling process, where solid circles represent reasoning chains and hollow circles their respective scores. Due to the minimal score difference between $\mathcal{R}_2$ and $\mathcal{R}_3$, three additional chains $\mathcal{R}_{10}$, $\mathcal{R}_{11}$, and $\mathcal{R}_{12}$ are sampled, yielding answers (A), (B), and (E). $\mathcal{R}_{10}$ and $\mathcal{R}_{11}$ are compared against chains with matching answers. $\mathcal{R}_{10}$ fails to outscore $\mathcal{R}_1$, while $\mathcal{R}_{11}$ surpasses $\mathcal{R}_8$, advancing to global evaluation. $\mathcal{R}_{12}$, introducing a new answer (E), exceeds the threshold $\epsilon$ and progresses. In the global evaluation, $\mathcal{R}_{11}$ outperforms others, and with its score difference with $\mathcal{R}_{12}$ exceeding $\theta$, thus answer (B) is selected as the final decision.
Figure 5: Performance comparison of AoR and various strong baselines on commonsense reasoning and symbolic reasoning tasks.
...and 9 more figures

Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models

TL;DR

Abstract

Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)