Table of Contents
Fetching ...

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting

Tim Knappe, Ryan Li, Ayush Chauhan, Kaylee Chhua, Kevin Zhu, Sean O'Brien

TL;DR

This work extends self-consistency for language models by incorporating semantic weighting of reasoning paths through embedding-based centroids and cosine-consensus measures, plus explicit outlier filtering. By separating semantic analysis from final majority voting, the approach improves robustness on complex reasoning tasks across arithmetic and commonsense benchmarks. Empirical results show that Semantic Consensus Weighting often outperforms Centroid Proximity Weighting and baseline self-consistency, with outlier detectors further enhancing accuracy with varying effectiveness across models and datasets. The framework provides a practical, scalable method to diagnose and improve reasoning quality in open-world tasks, while highlighting the importance of featurizer quality and hyperparameter tuning. Overall, semantic self-consistency offers a principled way to leverage semantic information in reasoning traces to achieve more reliable reasoning in large language models.

Abstract

While large language models (LLMs) have rapidly improved their performance on a broad number of tasks, they still often fall short on reasoning tasks. As LLMs become more integrated in diverse real-world tasks, advancing their reasoning capabilities is crucial to their effectiveness in nuanced, complex problems. Wang et al.'s self-consistency framework reveals that sampling multiple rationales before taking a majority vote reliably improves model performance across various closed-answer reasoning tasks. Standard methods based on this framework aggregate the final decisions of these rationales but fail to utilize the semantic information detailed in the step-by-step reasoning paths. Our work introduces semantic self-consistency, enhancing this approach by incorporating and analyzing both the reasoning paths of these rationales in addition to their final decisions before taking a majority vote. These methods not only improve the reliability of reasoning paths but also cause more robust performance on complex reasoning tasks.

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting

TL;DR

This work extends self-consistency for language models by incorporating semantic weighting of reasoning paths through embedding-based centroids and cosine-consensus measures, plus explicit outlier filtering. By separating semantic analysis from final majority voting, the approach improves robustness on complex reasoning tasks across arithmetic and commonsense benchmarks. Empirical results show that Semantic Consensus Weighting often outperforms Centroid Proximity Weighting and baseline self-consistency, with outlier detectors further enhancing accuracy with varying effectiveness across models and datasets. The framework provides a practical, scalable method to diagnose and improve reasoning quality in open-world tasks, while highlighting the importance of featurizer quality and hyperparameter tuning. Overall, semantic self-consistency offers a principled way to leverage semantic information in reasoning traces to achieve more reliable reasoning in large language models.

Abstract

While large language models (LLMs) have rapidly improved their performance on a broad number of tasks, they still often fall short on reasoning tasks. As LLMs become more integrated in diverse real-world tasks, advancing their reasoning capabilities is crucial to their effectiveness in nuanced, complex problems. Wang et al.'s self-consistency framework reveals that sampling multiple rationales before taking a majority vote reliably improves model performance across various closed-answer reasoning tasks. Standard methods based on this framework aggregate the final decisions of these rationales but fail to utilize the semantic information detailed in the step-by-step reasoning paths. Our work introduces semantic self-consistency, enhancing this approach by incorporating and analyzing both the reasoning paths of these rationales in addition to their final decisions before taking a majority vote. These methods not only improve the reliability of reasoning paths but also cause more robust performance on complex reasoning tasks.

Paper Structure

This paper contains 59 sections, 2 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Whereas baseline self-consistency comprises three steps: (1) Prompt a model with chain-of-thought, (2) generate n sampled sequences, and (3) choose results based on the most occurring final output, our proposed method, shown above, decides based on the semantic consistency of the employed reasoning path. Our assumption is that language models often apply the correct reasoning but lack the ability to conclude to the correct result.
  • Figure 2: Average Rouge-N Scores across StrategyQA, AQuA-RAT, and SVAMP for Different Models
  • Figure 3: Average
  • Figure 4: Squared Average
  • Figure 5: T-SNE reduced image based on a test on a subset of arithmetic reasoning examples, evaluated on 10, 15 and 20 generated outputs based on baseline self-consistency