Aligning Large Language Models via Fine-grained Supervision

Dehong Xu; Liang Qiu; Minseok Kim; Faisal Ladhak; Jaeyoung Do

Aligning Large Language Models via Fine-grained Supervision

Dehong Xu, Liang Qiu, Minseok Kim, Faisal Ladhak, Jaeyoung Do

TL;DR

This work addresses RLHF misalignment by moving from coarse sequence-level feedback to fine-grained token-level supervision, achieved through minimal edits to existing reward data and a dedicated token-level reward model. It introduces a two-part approach: data collection via targeted edits and a per-token reward signal integrated into PPO, replacing the traditional sequence-level reward. Empirical results show notable improvements in win-rate against a reference model and demonstrate improved reward-model accuracy and training efficiency. The approach promises more precise alignment and faster convergence for large language models in practical deployment scenarios.

Abstract

Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learning process. However, because this approach operates on sequence-level feedback, it lacks the precision to identify the exact parts of the output affecting user preferences. To address this gap, we propose a method to enhance LLM alignment through fine-grained token-level supervision. Specifically, we ask annotators to minimally edit less preferred responses within the standard reward modeling dataset to make them more favorable, ensuring changes are made only where necessary while retaining most of the original content. The refined dataset is used to train a token-level reward model, which is then used for training our fine-grained Proximal Policy Optimization (PPO) model. Our experiment results demonstrate that this approach can achieve up to an absolute improvement of $5.1\%$ in LLM performance, in terms of win rate against the reference model, compared with the traditional PPO model.

Aligning Large Language Models via Fine-grained Supervision

TL;DR

Abstract

in LLM performance, in terms of win rate against the reference model, compared with the traditional PPO model.

Paper Structure (15 sections, 4 equations, 5 figures, 3 tables)

This paper contains 15 sections, 4 equations, 5 figures, 3 tables.

Introduction
Method
Fine-grained data collection via minimal editing
Token-level reward modeling
Experiments
Experimental setup
Experiment results
Results in human value alignment
Reward model analysis
Training efficiency
Limitations
Conclusion
Appendix
Prompt for Minimal Editing
More examples of minimal editing

Figures (5)

Figure 1: The comparison between sequence-level reward modeling (Left) and our method of fine-grained reward modeling via minimal editing (Right). Our approach diverges from sequence-level reward modeling in two key aspects: (1) Data Collection, where we ask a human or LLM to edit the model response; and (2) Reward Modeling, which enables our model to assign rewards to individual tokens, as opposed to assessing the entire sequence collectively.
Figure 2: Prompt for Claude
Figure 3: Example of fine-grained annotation via minimal editing: edit words may cause safety issues.
Figure 4: Example of fine-grained annotation via minimal editing: provide more explanation on academic words.
Figure 5: Example of fine-grained annotation via minimal editing: change the literary device that follows the instruction better.

Aligning Large Language Models via Fine-grained Supervision

TL;DR

Abstract

Aligning Large Language Models via Fine-grained Supervision

Authors

TL;DR

Abstract

Table of Contents

Figures (5)