Table of Contents
Fetching ...

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning

Jared Joselowitz, Ritam Majumdar, Arjun Jagota, Matthieu Bou, Nyal Patel, Satyapriya Krishna, Sonali Parbhoo

TL;DR

This paper tackles the opacity of reward functions learned through RLHF by applying inverse reinforcement learning (IRL), specifically Maximum Margin IRL, to recover an explicit reward model $\hat{R}$ from RLHF-trained LLMs. The authors implement a four-step IRL pipeline—data curation, ground-truth reward modeling, RLHF fine-tuning, and IRL reward recovery—and validate it on toxicity benchmarks using Pythia 70M and 410M models. They demonstrate that IRL can recover rewards that align with human judgments, reveal non-identifiability of reward functions, and show how reward quality affects downstream RLHF performance; good IRL rewards can enable comparable or improved toxicity reduction, while poor rewards can degrade safety. Collectively, the results establish IRL as a diagnostic and auditing tool for LLM alignment, with practical implications for safer deployment and targeted interventions against reward hacking and misalignment.

Abstract

Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 85% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning

TL;DR

This paper tackles the opacity of reward functions learned through RLHF by applying inverse reinforcement learning (IRL), specifically Maximum Margin IRL, to recover an explicit reward model from RLHF-trained LLMs. The authors implement a four-step IRL pipeline—data curation, ground-truth reward modeling, RLHF fine-tuning, and IRL reward recovery—and validate it on toxicity benchmarks using Pythia 70M and 410M models. They demonstrate that IRL can recover rewards that align with human judgments, reveal non-identifiability of reward functions, and show how reward quality affects downstream RLHF performance; good IRL rewards can enable comparable or improved toxicity reduction, while poor rewards can degrade safety. Collectively, the results establish IRL as a diagnostic and auditing tool for LLM alignment, with practical implications for safer deployment and targeted interventions against reward hacking and misalignment.

Abstract

Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 85% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.

Paper Structure

This paper contains 36 sections, 2 equations, 18 figures, 6 tables, 2 algorithms.

Figures (18)

  • Figure 1: Evaluation of the IRL: extracted reward function on toxic and non-toxic adjective completions. Left: Scatter plot comparing ground truth rewards (x-axis) to IRL-extracted rewards (y-axis), revealing strong alignment and effective ranking. Middle: Violin plot showing clear separation between toxic and non-toxic samples in terms of extracted reward. Right: Confusion matrix indicating strong classification performance (Precision: 0.75, Recall: 0.90), with high accuracy in identifying toxic outputs and some false negatives among non-toxic samples.
  • Figure 2: Left (a, b, c): Increasing toxic examples in training improves precision, Kendall Tau, and recall, enhancing the model’s ability to rank non-toxic outputs. Right (d, e, f): Adding non-toxic data while keeping toxic samples fixed degrades classification and ranking quality, as precision and Kendall Tau decline, though recall remains high with slight variability.
  • Figure 3: IRL classification degrades slowly with Gaussian noise, highlighting its resilience.
  • Figure 4: (a) Accuracy and correlation over 60 epochs—solid lines show ground-truth accuracy, dashed lines show correlation with labels. Both 70M and 410M models surpass ground-truth in accuracy and correlation at convergence. (b) IRL-extracted models for toxic text classification: the 70M model achieves 84.15% accuracy, 82.36% F1, while the 410M model reaches 88.52% accuracy, 86.19% F1, slightly outperforming ground-truth. (c) The 70M IRL-RLHF model has lower losses, indicating better optimization. (d) The 410M model better captures reward function nuances. (e-h) Both models achieve higher returns and normalized mean rewards.
  • Figure 5: Comparison of reward-distributions. Good reward models successfully separate (Left) while poor models fail to separate (Right) toxic and non-toxic sentences.
  • ...and 13 more figures