Table of Contents
Fetching ...

Alleviating Attention Hacking in Discriminative Reward Modeling through Interaction Distillation

Jianxiang Zang

TL;DR

The paper addresses the vulnerability of discriminative reward models to attention misallocation in RLHF by diagnosing two limitations: forward-decaying intra-sequence attention and lack of inter-sequence attention. It introduces Interaction Distillation, which uses an interaction-based NLU teacher to provide comprehensive attention maps and trains the RM to simulate and align with these patterns through an attentional distillation objective, without increasing inference costs. Empirical results on RLHF and out-of-distribution tasks show Id-Rm delivers more stable and generalizable reward signals than state-of-the-art RM optimization methods, improving policy quality and robustness. Overall, the work highlights attention-level interactions as a fundamental factor in RM performance and demonstrates a practical approach to align RM signals with richer token-level interactions.

Abstract

The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, the mainstream discriminative reward modeling is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this "attention hacking", we propose "Interaction Distillation", a novel training framework for more adequate discriminative reward modeling via attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the reward modeling to simulate teacher model's interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in discriminative RM.

Alleviating Attention Hacking in Discriminative Reward Modeling through Interaction Distillation

TL;DR

The paper addresses the vulnerability of discriminative reward models to attention misallocation in RLHF by diagnosing two limitations: forward-decaying intra-sequence attention and lack of inter-sequence attention. It introduces Interaction Distillation, which uses an interaction-based NLU teacher to provide comprehensive attention maps and trains the RM to simulate and align with these patterns through an attentional distillation objective, without increasing inference costs. Empirical results on RLHF and out-of-distribution tasks show Id-Rm delivers more stable and generalizable reward signals than state-of-the-art RM optimization methods, improving policy quality and robustness. Overall, the work highlights attention-level interactions as a fundamental factor in RM performance and demonstrates a practical approach to align RM signals with richer token-level interactions.

Abstract

The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, the mainstream discriminative reward modeling is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this "attention hacking", we propose "Interaction Distillation", a novel training framework for more adequate discriminative reward modeling via attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the reward modeling to simulate teacher model's interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in discriminative RM.

Paper Structure

This paper contains 27 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Attentional analysis of the reward model trained on the HH-RLHF dataset on the test set. The line graph shows the average accuracy of the reward model on the test set when masking content at different stages; the bar chart represents the normalized average cumulative attention at each stage from the beginning to the end of the joint prompt-response sequence, where “correct”, “incorrect” denote correctly classified and misclassified sequences.
  • Figure 2: Attentional analysis of two reward model trained on the HH-RLHF dataset on the test set, including attention scores of tokens in the reward model encoding for correct (upper) and incorrect (bottom) decisions.
  • Figure 3: The interaction distillation-based preference modeling framework for reward model (Id-Rm).
  • Figure 4: Attention transfer mechanism in interaction distillation
  • Figure 5: Hyperparameter sensitivity study, for $\eta$ (left) and $K$ (right) respectively.