Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment

Elias Malomgré; Pieter Simoens

Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment

Elias Malomgré, Pieter Simoens

TL;DR

The Alignment Flywheel is introduced, a human-in-the-loop lifecycle that iteratively hardens the reward model through automated audits and refinement, producing an inspectable, editable, and model-agnostic reward model.

Abstract

AI alignment is growing in importance, yet current approaches suffer from a critical structural flaw that entangles the safety objectives with the agent's policy. Methods such as Reinforcement Learning from Human Feedback and Direct Preference Optimization create opaque, single-use alignment artifacts, which we term Alignment Waste. We propose Interactionless Inverse Reinforcement Learning to decouple alignment artifact learning from policy optimization, producing an inspectable, editable, and model-agnostic reward model. Additionally, we introduce the Alignment Flywheel, a human-in-the-loop lifecycle that iteratively hardens the reward model through automated audits and refinement. This architecture transforms safety from a disposable expense into a durable, verifiable engineering asset.

Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment

TL;DR

Abstract

Paper Structure (12 sections, 2 equations, 1 figure)

This paper contains 12 sections, 2 equations, 1 figure.

Introduction
The IIRL Paradigm
The IIRL Objective
Analysis of Architectures
Modular and Editable Architectures
The Refinement Toolkit
THE ALIGNMENT FLYWHEEL
Phase 0: Seeding and Defining Constraints
Phase 1: Automated Auditing
Phase 2 & 3: Triage and Refinement
Application
Implications and Vision

Figures (1)

Figure 1: Alignment Flywheel in a 3D toy world. A representation-based IIRL model trained on sparse expert samples generates a reward landscape with $g_\psi$; yellow=low, purple=high. A spurious extrapolation (red circle) is detected in Phase 1 and corrected via refinement in Phases 2 & 3.

Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment

TL;DR

Abstract

Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (1)