Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning

Jared Joselowitz; Ritam Majumdar; Arjun Jagota; Matthieu Bou; Nyal Patel; Satyapriya Krishna; Sonali Parbhoo

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning

Jared Joselowitz, Ritam Majumdar, Arjun Jagota, Matthieu Bou, Nyal Patel, Satyapriya Krishna, Sonali Parbhoo

TL;DR

This paper tackles the opacity of reward functions learned through RLHF by applying inverse reinforcement learning (IRL), specifically Maximum Margin IRL, to recover an explicit reward model $\hat{R}$ from RLHF-trained LLMs. The authors implement a four-step IRL pipeline—data curation, ground-truth reward modeling, RLHF fine-tuning, and IRL reward recovery—and validate it on toxicity benchmarks using Pythia 70M and 410M models. They demonstrate that IRL can recover rewards that align with human judgments, reveal non-identifiability of reward functions, and show how reward quality affects downstream RLHF performance; good IRL rewards can enable comparable or improved toxicity reduction, while poor rewards can degrade safety. Collectively, the results establish IRL as a diagnostic and auditing tool for LLM alignment, with practical implications for safer deployment and targeted interventions against reward hacking and misalignment.

Abstract

Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 85% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning

TL;DR

This paper tackles the opacity of reward functions learned through RLHF by applying inverse reinforcement learning (IRL), specifically Maximum Margin IRL, to recover an explicit reward model

from RLHF-trained LLMs. The authors implement a four-step IRL pipeline—data curation, ground-truth reward modeling, RLHF fine-tuning, and IRL reward recovery—and validate it on toxicity benchmarks using Pythia 70M and 410M models. They demonstrate that IRL can recover rewards that align with human judgments, reveal non-identifiability of reward functions, and show how reward quality affects downstream RLHF performance; good IRL rewards can enable comparable or improved toxicity reduction, while poor rewards can degrade safety. Collectively, the results establish IRL as a diagnostic and auditing tool for LLM alignment, with practical implications for safer deployment and targeted interventions against reward hacking and misalignment.

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning

TL;DR

Abstract

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)