Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank
Jiayu Liu, Wei Dai, Zhenya Huang, Ning Miao, Enhong Chen
TL;DR
The paper addresses the challenge of verifying LLM reasoning without external tools by introducing the correlation matrix rank as a proxy for correctness. It defines Self-Indicator, a plug-and-play method that generates multiple reasoning paths, computes two ranks per path using problem–solution templates, and weights paths to boost overall reasoning performance. The authors provide theoretical intuition and empirical validation across multiple backbones (e.g., LLaMA2-13B, LLaMA3-70B, GPT-3.5-Turbo) and math benchmarks (GSM8K, MATH, AIME24), showing that the approach distinguishes correct from incorrect reasoning with over 75% accuracy and improves benchmark accuracy by more than 8%. The method is model-agnostic, low-cost, and can complement existing verification strategies, making it practical for real-world deployment and potentially extendable to other NLP tasks.
Abstract
Despite the strong reasoning ability of large language models~(LLMs), they are prone to errors and hallucinations. As a result, how to check their outputs effectively and efficiently has become a critical problem in their applications. Existing checking methods heavily rely on external resources, such as trained verifiers (e.g., process/outcome reward models) or elaborate prompts, which lead to high computational overhead and are only applicable to specific domains. In this paper, we investigate whether the internal behaviors of LLMs have already implied the credibility of their reasoning paths. Specifically, we find that the rank of the correlation matrix between the input problem and the output reasoning path is a robust indicator of reasoning correctness. Different from other correctness indicators for LLMs, the calculation of the correlation matrix only relies on the LLM itself, which avoids the hassle of training a separate model or designing complicated prompts. Based on it, we design a simple, plug-and-play Self-Indicator method to reweight candidate reasoning paths, which achieves significant performance improvements than other voting and verification methods with very few computational overhead. Our experiments across multiple LLMs of varying scales and model families have further shown the effectiveness of Self-Indicator. It achieves over 75% accuracy in distinguishing correct reasoning paths from incorrect ones, and, in turn, improves the accuracies on three reasoning benchmarks by more than 8%.
