Self-Training Meets Consistency: Improving LLMs' Reasoning with Consistency-Driven Rationale Evaluation
Jaehyeok Lee, Keisuke Sakaguchi, JinYeong Bak
TL;DR
CREST introduces a consistency-driven approach to evaluate self-generated rationales for LLM reasoning by using follow-up questions to gauge robustness. It combines rationale filtering and mixed-preference learning via Direct Preference Optimization to train more reliable reasoning processes. Across ReClor, ARC, and CSQA, CREST achieves higher accuracy and yields more robust, logically sound rationales than prior self-training methods. By measuring consistency across related questions, CREST mitigates learning from flawed reasoning paths and improves practical reasoning performance.
Abstract
Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.
