Assessing LLM Reasoning Steps via Principal Knowledge Grounding
Hyeon Hwang, Yewon Cho, Chanwoong Yoon, Yein Park, Minju Song, Kyungjae Lee, Gangwoo Kim, Jaewoo Kang
TL;DR
This work tackles the challenge of validating that LLM intermediate reasoning is grounded in essential knowledge. It introduces a Principal Knowledge Collection (112k PK units) derived from the MMLU benchmark, and defines knowledge-grounded metrics—Knowledge Precision (KP), Knowledge Recall (KR), and F1—to quantify how reasoning recalls and applies requisite knowledge. An open-weight evaluator is distilled from a strong teacher model to enable cost-effective assessment of PK usage in reasoning. By integrating these metrics into Direct Preference Optimization and reasoning-focused variants, the approach enables controlled reasoning depth and improved end-task performance, while also reducing token consumption through knowledge-aware guidance. Overall, the framework provides interpretable diagnostics for knowledge grounding and practical mechanisms to steer LLM reasoning toward concise, correct, and knowledge-grounded solutions.
Abstract
Step-by-step reasoning has become a standard approach for large language models (LLMs) to tackle complex tasks. While this paradigm has proven effective, it raises a fundamental question: How can we verify that an LLM's reasoning is accurately grounded in knowledge? To address this question, we introduce a novel evaluation suite that systematically assesses the knowledge grounding of intermediate reasoning. Our framework comprises three key components. (1) Principal Knowledge Collection, a large-scale repository of atomic knowledge essential for reasoning. Based on the collection, we propose (2) knowledge-grounded evaluation metrics designed to measure how well models recall and apply prerequisite knowledge in reasoning. These metrics are computed by our (3) evaluator LLM, a lightweight model optimized for cost-effective and reliable metric computation. Our evaluation suite demonstrates remarkable effectiveness in identifying missing or misapplied knowledge elements, providing crucial insights for uncovering fundamental reasoning deficiencies in LLMs. Beyond evaluation, we demonstrate how these metrics can be integrated into preference optimization, showcasing further applications of knowledge-grounded evaluation.
