TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge

Cheng-Han Chiang; Hung-yi Lee; Michal Lukasik

TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge

Cheng-Han Chiang, Hung-yi Lee, Michal Lukasik

TL;DR

TRACT tackles the mismatch between numeric score prediction and conventional cross-entropy fine-tuning in LLM-based evaluation by fusing chain-of-thought reasoning with regression-aware objectives in a two-stage training pipeline. Stage 1 seeds CoT reasoning alongside ground-truth scores; stage 2 retrains using self-generated CoTs to align training and inference distributions. The core CoT-RAFT objective jointly optimizes CoT generation and squared-error score prediction, with self-generated CoTs mitigating distribution drift. Across multiple datasets and base models, TRACT consistently outperforms strong baselines, with ablations confirming the necessity of CoT guidance, regression-aware losses, and self-generated CoTs for robust numeric evaluation.

Abstract

The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.

TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge

TL;DR

Abstract

TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)