Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting
Tim Knappe, Ryan Li, Ayush Chauhan, Kaylee Chhua, Kevin Zhu, Sean O'Brien
TL;DR
This work extends self-consistency for language models by incorporating semantic weighting of reasoning paths through embedding-based centroids and cosine-consensus measures, plus explicit outlier filtering. By separating semantic analysis from final majority voting, the approach improves robustness on complex reasoning tasks across arithmetic and commonsense benchmarks. Empirical results show that Semantic Consensus Weighting often outperforms Centroid Proximity Weighting and baseline self-consistency, with outlier detectors further enhancing accuracy with varying effectiveness across models and datasets. The framework provides a practical, scalable method to diagnose and improve reasoning quality in open-world tasks, while highlighting the importance of featurizer quality and hyperparameter tuning. Overall, semantic self-consistency offers a principled way to leverage semantic information in reasoning traces to achieve more reliable reasoning in large language models.
Abstract
While large language models (LLMs) have rapidly improved their performance on a broad number of tasks, they still often fall short on reasoning tasks. As LLMs become more integrated in diverse real-world tasks, advancing their reasoning capabilities is crucial to their effectiveness in nuanced, complex problems. Wang et al.'s self-consistency framework reveals that sampling multiple rationales before taking a majority vote reliably improves model performance across various closed-answer reasoning tasks. Standard methods based on this framework aggregate the final decisions of these rationales but fail to utilize the semantic information detailed in the step-by-step reasoning paths. Our work introduces semantic self-consistency, enhancing this approach by incorporating and analyzing both the reasoning paths of these rationales in addition to their final decisions before taking a majority vote. These methods not only improve the reliability of reasoning paths but also cause more robust performance on complex reasoning tasks.
