Table of Contents
Fetching ...

ALaRM: Align Language Models via Hierarchical Rewards Modeling

Yuhang Lai, Siyuan Wang, Shujun Liu, Xuanjing Huang, Zhongyu Wei

TL;DR

ALaRM presents a hierarchical RLHF framework that combines holistic rewards with aspect-specific signals to improve alignment of LLMs to human preferences. By proactively selecting consistent aspect rewards and integrating them with the holistic signal in a hierarchical, threshold-guided manner, the approach stabilizes training and mitigates reward sparsity. Empirical results on long-form QA and MT demonstrate improved holistic quality and task-specific metrics, with ablations confirming the value of reward selection and hierarchical structuring. The method offers a scalable path for human-alignment efforts, albeit requiring task-specific reward design and computation resources for practical deployment.

Abstract

We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback (RLHF), which is designed to enhance the alignment of large language models (LLMs) with human preferences. The framework addresses the limitations of current alignment approaches, which often struggle with the inconsistency and sparsity of human supervision signals, by integrating holistic rewards with aspect-specific rewards. This integration enables more precise and consistent guidance of language models towards desired outcomes, particularly in complex and open text generation tasks. By employing a methodology that filters and combines multiple rewards based on their consistency, the framework provides a reliable mechanism for improving model alignment. We validate our approach through applications in long-form question answering and machine translation tasks, employing gpt-3.5-turbo for pairwise comparisons, and demonstrate improvements over existing baselines. Our work underscores the effectiveness of hierarchical rewards modeling in refining LLM training processes for better human preference alignment. We release our code at https://ALaRM-fdu.github.io.

ALaRM: Align Language Models via Hierarchical Rewards Modeling

TL;DR

ALaRM presents a hierarchical RLHF framework that combines holistic rewards with aspect-specific signals to improve alignment of LLMs to human preferences. By proactively selecting consistent aspect rewards and integrating them with the holistic signal in a hierarchical, threshold-guided manner, the approach stabilizes training and mitigates reward sparsity. Empirical results on long-form QA and MT demonstrate improved holistic quality and task-specific metrics, with ablations confirming the value of reward selection and hierarchical structuring. The method offers a scalable path for human-alignment efforts, albeit requiring task-specific reward design and computation resources for practical deployment.

Abstract

We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback (RLHF), which is designed to enhance the alignment of large language models (LLMs) with human preferences. The framework addresses the limitations of current alignment approaches, which often struggle with the inconsistency and sparsity of human supervision signals, by integrating holistic rewards with aspect-specific rewards. This integration enables more precise and consistent guidance of language models towards desired outcomes, particularly in complex and open text generation tasks. By employing a methodology that filters and combines multiple rewards based on their consistency, the framework provides a reliable mechanism for improving model alignment. We validate our approach through applications in long-form question answering and machine translation tasks, employing gpt-3.5-turbo for pairwise comparisons, and demonstrate improvements over existing baselines. Our work underscores the effectiveness of hierarchical rewards modeling in refining LLM training processes for better human preference alignment. We release our code at https://ALaRM-fdu.github.io.
Paper Structure (44 sections, 2 equations, 6 figures, 8 tables)

This paper contains 44 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Illustration of our key ideas. The pretrained policies are first supervised fine-tuned on human-written demonstrations and then trained through RLHF given a holistic reward learned from human comparisons. The shadowed "superior area" better aligns with human preference, which is hard to reach for solely a noisy holistic reward. We propose to utilize multiple rewards hierarchically for more accurate and consistent supervision signals and thus guide the policies into the superior area.
  • Figure 2: Illustration of our framework. The reward modeling is decomposed into two parts: 1) Directly assign the holistic reward to improve general quality, 2) combine the holistic reward and proactively selected aspect-specific rewards as a whole reward, which is supposed to be more accurate and consistent.
  • Figure 3: Inconsistency with the holistic reward for listed aspect-specific rewards and win rates of the greedy decoding against the pure sampling in long-form QA.
  • Figure 4: Selection results of inconsistency and win rates in MT. ${}^{*}$: The lower grammar error rate wins.
  • Figure 5: The inference prompt for UltraRM.
  • ...and 1 more figures