Table of Contents
Fetching ...

Preference-grounded Token-level Guidance for Language Model Fine-tuning

Shentao Yang, Shujian Zhang, Congying Xia, Yihao Feng, Caiming Xiong, Mingyuan Zhou

TL;DR

This paper develops an alternate training process, where it iterates between grounding the sequence-level preference into token-level training guidance, and improving the LM with the learned guidance, to address a granularity mismatch between the preference and the LM training losses.

Abstract

Aligning language models (LMs) with preferences is an important problem in natural language generation. A key challenge is that preferences are typically provided at the sequence level while LM training and generation both occur at the token level. There is, therefore, a granularity mismatch between the preference and the LM training losses, which may complicate the learning problem. In this paper, we address this issue by developing an alternate training process, where we iterate between grounding the sequence-level preference into token-level training guidance, and improving the LM with the learned guidance. For guidance learning, we design a framework that extends the pairwise-preference learning in imitation learning to both variable-length LM generation and the utilization of the preference among multiple generations. For LM training, based on the amount of supervised data, we present two minimalist learning objectives that utilize the learned guidance. In experiments, our method performs competitively on two distinct representative LM tasks -- discrete-prompt generation and text summarization.

Preference-grounded Token-level Guidance for Language Model Fine-tuning

TL;DR

This paper develops an alternate training process, where it iterates between grounding the sequence-level preference into token-level training guidance, and improving the LM with the learned guidance, to address a granularity mismatch between the preference and the LM training losses.

Abstract

Aligning language models (LMs) with preferences is an important problem in natural language generation. A key challenge is that preferences are typically provided at the sequence level while LM training and generation both occur at the token level. There is, therefore, a granularity mismatch between the preference and the LM training losses, which may complicate the learning problem. In this paper, we address this issue by developing an alternate training process, where we iterate between grounding the sequence-level preference into token-level training guidance, and improving the LM with the learned guidance. For guidance learning, we design a framework that extends the pairwise-preference learning in imitation learning to both variable-length LM generation and the utilization of the preference among multiple generations. For LM training, based on the amount of supervised data, we present two minimalist learning objectives that utilize the learned guidance. In experiments, our method performs competitively on two distinct representative LM tasks -- discrete-prompt generation and text summarization.
Paper Structure (28 sections, 15 equations, 6 figures, 13 tables, 2 algorithms)

This paper contains 28 sections, 15 equations, 6 figures, 13 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of the proposed framework. "AVG" denotes average, which is an example of the aggregation function $f(\cdot)$ discussed in Section \ref{['sec:method_agg_func']}. "Seq Eval" refers to the parametrized sequence-level evaluations. The model choice of the reward function and LM depends on the specific task and is discussed in Section \ref{['sec:exp']}.
  • Figure 2: CNN/DM summarization of our method and baselines under T5-base LM. "Sup" denotes "Supervised". "Ref" denotes the ground-truth reference summary. Except for the human study in (f), baseline results are directly cited from RL4LMs nlpo2022 and are the per-metric best across their three environmental rewards.
  • Figure 3: Performance of our method using sequence-level and token-level preference-based guidance. "Best Baseline" refers to the best result in the baseline discrete-prompt methods for the prompt task, and the best result over all baseline methods for the summarization task. Error bars show one standard deviation.
  • Figure 4: Performance of our method with and without the reward-function retraining scheme. "Best Baseline" refers to the same as in Fig. \ref{['fig:seq_rew']}. Error bars show one standard deviation.
  • Figure 5: Varying the number of sequences to learn the token-level guidance, showing mean over random seeds.
  • ...and 1 more figures