Table of Contents
Fetching ...

Aligning Large Language Models by On-Policy Self-Judgment

Sangkyu Lee, Sungdong Kim, Ashkan Yousefpour, Minjoon Seo, Kang Min Yoo, Youngjae Yu

TL;DR

This paper presents a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient, as it does not require an additional RM for evaluating the samples for on-policy learning.

Abstract

Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning. In this paper, we present a novel alignment framework, SELF-JUDGE that (1) does on-policy learning and 2) is parameter efficient, as it does not require an additional RM for evaluating the samples for on-policy learning. To this end, we propose Judge-augmented Supervised Fine-Tuning (JSFT) to train a single model to act as both a policy and a judge. Specifically, we view the pairwise judgment task, choosing the better response from a response pair, as a special case of the instruction-following task. The resulting model can judge preferences of on-the-fly responses from current policy initialized from itself. Experimental results show the efficacy of SELF-JUDGE, outperforming baselines in preference benchmarks. We also show that the rejecting sampling by itself can improve performance further without an additional evaluator.

Aligning Large Language Models by On-Policy Self-Judgment

TL;DR

This paper presents a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient, as it does not require an additional RM for evaluating the samples for on-policy learning.

Abstract

Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning. In this paper, we present a novel alignment framework, SELF-JUDGE that (1) does on-policy learning and 2) is parameter efficient, as it does not require an additional RM for evaluating the samples for on-policy learning. To this end, we propose Judge-augmented Supervised Fine-Tuning (JSFT) to train a single model to act as both a policy and a judge. Specifically, we view the pairwise judgment task, choosing the better response from a response pair, as a special case of the instruction-following task. The resulting model can judge preferences of on-the-fly responses from current policy initialized from itself. Experimental results show the efficacy of SELF-JUDGE, outperforming baselines in preference benchmarks. We also show that the rejecting sampling by itself can improve performance further without an additional evaluator.
Paper Structure (42 sections, 2 equations, 5 figures, 20 tables)

This paper contains 42 sections, 2 equations, 5 figures, 20 tables.

Figures (5)

  • Figure 1: In our framework, Self-Judge, a single model is trained not only to generate responses but also to perform a judgment task, where it selects the better of the two responses through a single token prediction. This enables on-policy self-training by performing judgments on current policy for improving itself.
  • Figure 2: An overview of Self-Judge. 1) We train an LLM to act as a Judge Model (JM), which can both generate responses and compare response pairs. We train the JM with a SFT dataset augmented with the pairwise judgment task where the better response can be selected by a single token. 2) We initialize a policy and a fixed reference model from the trained JM. Then, the policy model samples response pairs, and the reference model performs judgments on the pairs for giving feedback with preference orders. 3) We perform a rejection sampling by a tournament on responses from the policy through the judgments by itself for further improvements at inference time.
  • Figure 3: An example of a judgment template $\mathcal{C}$. The judgment template asks which of the two responses is better for a given prompt and requests to select the judge token $\mathcal{J} \in \{\mathcal{A}, \mathcal{B}\}$ corresponding to the better response. Optionally, a principle for the judgment can be added to the judgment template, and the rationale can be included in the target sequence for training.
  • Figure 4: Winning rate, average response length, and 4-gram repetition on AlpacaEval according to the number of sampling ($N$) for self-rejection on JM and JM-PR after self-training. Even though LLM-as-a-judge tends to favor verbose responses zheng2023judging, JM-PR reliably improves the winning rate as $N$ increases, with smoother increments of response lengths and lower repetitions compared to JM.
  • Figure 5: Result of iterative self-training on AlpacaEval using JM-PR. Performance as a policy increases as iterations proceed without losing the capacity as a judge for applying self-rejection.