Table of Contents
Fetching ...

Self-Training Meets Consistency: Improving LLMs' Reasoning with Consistency-Driven Rationale Evaluation

Jaehyeok Lee, Keisuke Sakaguchi, JinYeong Bak

TL;DR

CREST introduces a consistency-driven approach to evaluate self-generated rationales for LLM reasoning by using follow-up questions to gauge robustness. It combines rationale filtering and mixed-preference learning via Direct Preference Optimization to train more reliable reasoning processes. Across ReClor, ARC, and CSQA, CREST achieves higher accuracy and yields more robust, logically sound rationales than prior self-training methods. By measuring consistency across related questions, CREST mitigates learning from flawed reasoning paths and improves practical reasoning performance.

Abstract

Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.

Self-Training Meets Consistency: Improving LLMs' Reasoning with Consistency-Driven Rationale Evaluation

TL;DR

CREST introduces a consistency-driven approach to evaluate self-generated rationales for LLM reasoning by using follow-up questions to gauge robustness. It combines rationale filtering and mixed-preference learning via Direct Preference Optimization to train more reliable reasoning processes. Across ReClor, ARC, and CSQA, CREST achieves higher accuracy and yields more robust, logically sound rationales than prior self-training methods. By measuring consistency across related questions, CREST mitigates learning from flawed reasoning paths and improves practical reasoning performance.

Abstract

Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.

Paper Structure

This paper contains 44 sections, 4 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: An example of rationale generation and evaluation in CREST: An LLM generates two rationales ($r^1$, $r^2$) and answer predictions to solve question Q. Even though $r^2$ lacks focus and clear support for the answer, previous approaches evaluate both $r^1$ and $r^2$ as equally right. Through a more fine-grained evaluation using follow-up questions, we can identify the better rationale, $r^1$, which leads to more consistent predictions across all questions.
  • Figure 2: Overview of CREST. In Rationale Generation (1), given a question $q_i$ and an answer $a_i$, an initial LLM $\textbf{M}$ generates $N$ rationales and answer predictions $(r_i, p_i)$ to solve $q_i$, and then solves follow-up questions $\tilde{q}_{i,f}$ using each rationale $r^n_i$, resulting in $\tilde{p}^n_{i,f}$. Next, in Rationale Evaluation (2), we assign rewards $z$ and $\tilde{z}$ to each rationale based on the correctness of the predictions as shown in Eq. \ref{['eq:z']} and Eq. \ref{['eq:tilde_z']}. In Supervised Fine-Tuning (3), we train $\textbf{M}$ on the rationales filtered by $z$ and $\tilde{z}$ with a tolerance term $t$, resulting in $\textbf{M}_\textbf{SFT}$. Finally, in Preference Learning (4), we build preference pairs based on $z$ and $\tilde{z}$, and train $\textbf{M}_\textbf{SFT}$ on them, resulting in $\textbf{M}_\textbf{CREST}$.
  • Figure 3: Distribution of rationale proportions based on $\tilde{z}$ for rationales with $z=1$ and $z=0$, respectively. For example, among the generated rationales with $z=0$ for CSQA, approximately 60% have $\tilde{z}=3$. Rationales with $z=0$ are relatively concentrated at lower $\tilde{z}$ values compared to those with $z=1$. This correlation between $z$ and $\tilde{z}$ suggests that $\tilde{z}$ reflects the quality of the rationale.
  • Figure 4: Proportion of rationale data used for training $\textbf{M}_\textbf{SFT}$ and task performance on three datasets, according to tolerance $t$. The results suggest that while moderate tolerance $t$ improves performance, while overly high $t$ values can degrade it, indicating the importance of excluding less robust rationales from training.
  • Figure 5: Task performance based on $\lambda$ between $P_{z}$ and $P_{\tilde{z}}$ in preference learning on ReClor. As $\lambda$ increases, the model learns more from $P_{\tilde{z}}$ than from $P_{z}$, which leads to improved performance on the Hard set, while performance on the Easy set tends to decrease. Overall performance peaks at $\lambda=0.6$, where the trade-off between the two datasets is balanced. These results suggest that preference learning on $P_{\tilde{z}}$ helps reduce the model's reliance on biases in the Easy set, enhancing the robustness of its reasoning ability.
  • ...and 8 more figures