CHIRPs: Change-Induced Regret Proxy metrics for Lifelong Reinforcement Learning

John Birkbeck; Adam Sobey; Federico Cerutti; Katherine Heseltine Hurley Flynn; Timothy J. Norman

CHIRPs: Change-Induced Regret Proxy metrics for Lifelong Reinforcement Learning

John Birkbeck, Adam Sobey, Federico Cerutti, Katherine Heseltine Hurley Flynn, Timothy J. Norman

TL;DR

This work tackles the challenge of predicting how environmental changes affect reinforcement learning performance by introducing Change-Induced Regret Proxies (CHIRPs). It formalizes Scaled Optimal Policy Regret (SOPR) and proposes using the $W_1$-MDP distance as a practical CHIRP to proxy SOPR from transition samples. Through experiments in SimpleGrid and MetaWorld, it demonstrates a positive, monotonic relationship between CHIRP values and SOPR, and shows how CHIRP-driven policy reuse (CPR) can markedly outperform existing lifelong RL methods, including in interleaved-task settings. Calibration via spline fitting further enables cross-environment comparisons, offering a scalable framework for predicting and mitigating change impact in lifelong reinforcement learning with substantial practical implications for real-world adaptability.

Abstract

Reinforcement learning (RL) agents are costly to train and fragile to environmental changes. They often perform poorly when there are many changing tasks, prohibiting their widespread deployment in the real world. Many Lifelong RL agent designs have been proposed to mitigate issues such as catastrophic forgetting or demonstrate positive characteristics like forward transfer when change occurs. However, no prior work has established whether the impact on agent performance can be predicted from the change itself. Understanding this relationship will help agents proactively mitigate a change's impact for improved learning performance. We propose Change-Induced Regret Proxy (CHIRP) metrics to link change to agent performance drops and use two environments to demonstrate a CHIRP's utility in lifelong learning. A simple CHIRP-based agent achieved $48\%$ higher performance than the next best method in one benchmark and attained the best success rates in 8 of 10 tasks in a second benchmark which proved difficult for existing lifelong RL agents.

CHIRPs: Change-Induced Regret Proxy metrics for Lifelong Reinforcement Learning

TL;DR

-MDP distance as a practical CHIRP to proxy SOPR from transition samples. Through experiments in SimpleGrid and MetaWorld, it demonstrates a positive, monotonic relationship between CHIRP values and SOPR, and shows how CHIRP-driven policy reuse (CPR) can markedly outperform existing lifelong RL methods, including in interleaved-task settings. Calibration via spline fitting further enables cross-environment comparisons, offering a scalable framework for predicting and mitigating change impact in lifelong reinforcement learning with substantial practical implications for real-world adaptability.

Abstract

higher performance than the next best method in one benchmark and attained the best success rates in 8 of 10 tasks in a second benchmark which proved difficult for existing lifelong RL agents.

Paper Structure (16 sections, 6 equations, 9 figures, 3 tables)

This paper contains 16 sections, 6 equations, 9 figures, 3 tables.

The Value of Measuring Change
Related Work
Preliminaries
Scaled Optimal Policy Regret
When is SOPR calculable?
Change-Induced Regret Proxy (CHIRP) Metrics
Validating $W_1$-MDP as a CHIRP
Validation in SimpleGrid
Approximating $W_1(\mathcal{M}_i, \mathcal{M}_j)$ with Sampling
Verification in MetaWorld
Lifelong Learning with a CHIRP
CHIRP Policy Reuse
Block learning of tasks
Interleaved tasks
Calibrating a CHIRP across environments
...and 1 more sections

Figures (9)

Figure 1: A SimpleGrid environment. The blue, circular agent must reach the red square goal. The grey edge squares are impassable walls.
Figure 2: SOPR against our CHIRP for $10,500$ SimpleGrid MDP variants. The medians of each data bin are marked. Data was binned into 16 approximately equal volumes ($n\approx$ 650). The overlaid calibration curve is used further below.
Figure 3: The ten tasks selected for our experimentation in MetaWorld. Left to right, top to bottom: get coffee, turn dial, unlock door, press handle side, press handle, slide plate back, slide plate, push, reach, reach with wall.
Figure 4: The CHIRP-SOPR relationship in MetaWorld for ten tasks. Each violin plot's median is marked. Data was binned to approximately equal volumes ($n\approx$ 5100). An example B-spline calibration curve is overlaid.
Figure 5: The median $W_1$ distances between 10 MetaWorld MDPs, shaded by distance value. The MDP index corresponds with Figure \ref{['fig:metaworld']}'s ordering; MDP 0 is 'coffee button', and MDP 9 is 'reach with wall'
...and 4 more figures

CHIRPs: Change-Induced Regret Proxy metrics for Lifelong Reinforcement Learning

TL;DR

Abstract

CHIRPs: Change-Induced Regret Proxy metrics for Lifelong Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)