Table of Contents
Fetching ...

TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge

Cheng-Han Chiang, Hung-yi Lee, Michal Lukasik

TL;DR

TRACT tackles the mismatch between numeric score prediction and conventional cross-entropy fine-tuning in LLM-based evaluation by fusing chain-of-thought reasoning with regression-aware objectives in a two-stage training pipeline. Stage 1 seeds CoT reasoning alongside ground-truth scores; stage 2 retrains using self-generated CoTs to align training and inference distributions. The core CoT-RAFT objective jointly optimizes CoT generation and squared-error score prediction, with self-generated CoTs mitigating distribution drift. Across multiple datasets and base models, TRACT consistently outperforms strong baselines, with ablations confirming the necessity of CoT guidance, regression-aware losses, and self-generated CoTs for robust numeric evaluation.

Abstract

The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.

TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge

TL;DR

TRACT tackles the mismatch between numeric score prediction and conventional cross-entropy fine-tuning in LLM-based evaluation by fusing chain-of-thought reasoning with regression-aware objectives in a two-stage training pipeline. Stage 1 seeds CoT reasoning alongside ground-truth scores; stage 2 retrains using self-generated CoTs to align training and inference distributions. The core CoT-RAFT objective jointly optimizes CoT generation and squared-error score prediction, with self-generated CoTs mitigating distribution drift. Across multiple datasets and base models, TRACT consistently outperforms strong baselines, with ablations confirming the necessity of CoT guidance, regression-aware losses, and self-generated CoTs for robust numeric evaluation.

Abstract

The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.

Paper Structure

This paper contains 33 sections, 4 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: TRACT method overview. (a) Illustration of the CoT-RAFT fine-tuning objective (Eq. \ref{['eq: cot-raft full loss']}), used in both stages of TRACT. (b) Two fine-tuning stages of TRACT (also see Algorithm \ref{['a:TRACT_ALGO']}). Stage 1: model $p_{\rm s}$ is trained over the ground truth scores and the annotation CoTs (generated by the annotation model $p_{\rm a}$). Stage 2: CoT supervision is sampled from $p_{\rm s}$ (frozen at this stage) and used to fine-tune the final model $p_{\rm tract}$.
  • Figure 2: Performance of TRACT across varying values of $\lambda$ in Equation \ref{['eq: cot-raft full loss']}. Results from the Mistral model. For a wide range of $\lambda$ values we find TRACT outperforming the baselines.
  • Figure 3: Average Pearson's $r$ as a function of the number of sampled CoTs. Results from fine-tuning the Mistral model. Shaded regions correspond to the standard deviations across multiple inference runs with varying random seeds. Note that Prometheus is trained on significantly more data compared to other two methods in this Figure. Despite that, under limited inference budget, TRACT outperforms Prometheus.