Table of Contents
Fetching ...

Self-Judge: Selective Instruction Following with Alignment Self-Evaluation

Hai Ye, Hwee Tou Ng

TL;DR

Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores, is introduced, which leverages the model's inherent self-evaluation capability to extract information about response quality from labeled instruction-tuning data.

Abstract

Pre-trained large language models (LLMs) can be tailored to adhere to human instructions through instruction tuning. However, due to shifts in the distribution of test-time data, they may not always execute instructions accurately, potentially generating factual errors or misaligned content when acting as chat assistants. To enhance the reliability of LLMs in following instructions, we propose the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low. We train judge models that can predict numerical quality scores for model responses. To address data scarcity, we introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores. Our method leverages the model's inherent self-evaluation capability to extract information about response quality from labeled instruction-tuning data. It incorporates a gold reference answer to facilitate self-evaluation and recalibrates by assessing the semantic similarity between the response sample and the gold reference. During the training phase, we implement self-distillation as a regularization technique to enhance the capability of reference-free estimation. To validate alignment evaluation on general instruction-following tasks, we collect large-scale high-quality instructions from Hugging Face for model training and evaluation. Extensive experiments on five open-source models show that our method correlates much more with GPT-4 than strong baselines, e.g., supervised models distilled from GPT-4 and GPT-3.5-turbo. Our analysis shows our model's strong generalization across domains. Additionally, our judge models serve as good reward models, e.g., boosting WizardLM-13B-V1.2 from 89.17 to 92.48 and from 12.03 to 15.90 in version v1 and v2 of AlpacaEval respectively using best-of-32 sampling with our judge models.

Self-Judge: Selective Instruction Following with Alignment Self-Evaluation

TL;DR

Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores, is introduced, which leverages the model's inherent self-evaluation capability to extract information about response quality from labeled instruction-tuning data.

Abstract

Pre-trained large language models (LLMs) can be tailored to adhere to human instructions through instruction tuning. However, due to shifts in the distribution of test-time data, they may not always execute instructions accurately, potentially generating factual errors or misaligned content when acting as chat assistants. To enhance the reliability of LLMs in following instructions, we propose the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low. We train judge models that can predict numerical quality scores for model responses. To address data scarcity, we introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores. Our method leverages the model's inherent self-evaluation capability to extract information about response quality from labeled instruction-tuning data. It incorporates a gold reference answer to facilitate self-evaluation and recalibrates by assessing the semantic similarity between the response sample and the gold reference. During the training phase, we implement self-distillation as a regularization technique to enhance the capability of reference-free estimation. To validate alignment evaluation on general instruction-following tasks, we collect large-scale high-quality instructions from Hugging Face for model training and evaluation. Extensive experiments on five open-source models show that our method correlates much more with GPT-4 than strong baselines, e.g., supervised models distilled from GPT-4 and GPT-3.5-turbo. Our analysis shows our model's strong generalization across domains. Additionally, our judge models serve as good reward models, e.g., boosting WizardLM-13B-V1.2 from 89.17 to 92.48 and from 12.03 to 15.90 in version v1 and v2 of AlpacaEval respectively using best-of-32 sampling with our judge models.
Paper Structure (28 sections, 7 equations, 15 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 15 figures, 6 tables, 1 algorithm.

Figures (15)

  • Figure 1: Selective instruction following with alignment evaluation. We train a judge model to rate an LLM's response with a numerical score.
  • Figure 2: The prompt used for alignment evaluation. It is used by GPT-4 and other open-source models studied in this work. We use the version of reference-based evaluation, where a reference answer is required for evaluation. The prompt is adopted from zheng2023judging and DBLP:journals/corr/abs-2305-14314. As indicated by zheng2023judging, reference answers can improve the performance of GPT-4's evaluations on reasoning tasks, such as coding problems and mathematical questions.
  • Figure 3: Illustration of Self-J (see also the pseudocode of Algorithm \ref{['code:self-j']}). (a) We first conduct instruction tuning on a pre-trained LLM (or directly using an existing instruction-tuned model, e.g., Vicuna). (b) We generate quality scores with model self-evaluation recalibrated by a semantic similarity score. (c) With the generated quality scores, we train a judge model through self-distillation.
  • Figure 4: Diversity of our instruction set (with 30k random samples for judge model training), showing root verbs in the inner circle and their first nouns in the outer circle.
  • Figure 5: The proportions of different categories in our whole instruction set (about 5.7 million instructions). Common topics have the highest number of questions, followed by coding questions, with academic questions being the least frequent.
  • ...and 10 more figures