Table of Contents
Fetching ...

References Improve LLM Alignment in Non-Verifiable Domains

Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty, Arman Cohan

TL;DR

This work addresses LLM alignment in non-verifiable domains by introducing reference-guided LLM-evaluators as soft verifiers to enable post-training improvements without external supervision. It develops targeted prompting strategies (RefEval and RefMatch) that leverage reference outputs to significantly improve judge accuracy across multiple models and benchmarks. The authors demonstrate a two-stage self-improvement pipeline—SFT on high-quality references followed by DPO guided by reference-grounded judges—that yields substantial gains, rivaling finetuned reward models like ArmoRM. The findings highlight the practical potential of reference-based supervision for efficient LLM post-training in non-verifiable domains and point to future work on richer reference sources and domain-specific reward design.

Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.

References Improve LLM Alignment in Non-Verifiable Domains

TL;DR

This work addresses LLM alignment in non-verifiable domains by introducing reference-guided LLM-evaluators as soft verifiers to enable post-training improvements without external supervision. It develops targeted prompting strategies (RefEval and RefMatch) that leverage reference outputs to significantly improve judge accuracy across multiple models and benchmarks. The authors demonstrate a two-stage self-improvement pipeline—SFT on high-quality references followed by DPO guided by reference-grounded judges—that yields substantial gains, rivaling finetuned reward models like ArmoRM. The findings highlight the practical potential of reference-based supervision for efficient LLM post-training in non-verifiable domains and point to future work on richer reference sources and domain-specific reward design.

Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.
Paper Structure (40 sections, 1 equation, 24 figures, 20 tables)

This paper contains 40 sections, 1 equation, 24 figures, 20 tables.

Figures (24)

  • Figure 1: Overview of our study on reference-guided LLM-as-a-Judge for LLM alignment. Conceptual plots illustrating (I) the improvement in average accuracy from reference-guided evaluation (§\ref{['subsec:evaluation_setup']}) and (II) the reference-guided self-improvement (§\ref{['sec:training_setup']}).
  • Figure 2: A snapshot of RefEval method.
  • Figure 3: Comparison of reference-free and reference-guided self-improvement across task categories on AlpacaEval and Arena-Hard.
  • Figure 4: Aggregate performance by dataset for Larger Models ($>$ 9B parameters, including GPT-4o variants; top panel) and Smaller Models ($\leq$9B parameters; bottom panel). RefEval demonstrates consistent improvements across most datasets for both model groups.
  • Figure 5: Evaluation accuracy of 11 open-source LLM-judges using RefEval and RefMatch with single references from various frontier models, and their voted versions. Horizontal dashed lines indicate reference-free baselines. Results are averaged over five datasets.
  • ...and 19 more figures