Robust Intervention Learning from Emergency Stop Interventions

Ethan Pronovost; Khimya Khetarpal; Siddhartha Srinivasa

Robust Intervention Learning from Emergency Stop Interventions

Ethan Pronovost, Khimya Khetarpal, Siddhartha Srinivasa

TL;DR

This work addresses learning from imperfect deployment-time interventions, specifically emergency-stop signals, by formulating Robust Intervention Learning (RIL) and proposing Residual Intervention Fine-Tuning (RIFT). By treating intervention feedback as an incomplete signal and regularizing updates toward a prior policy, RIFT enables robust policy improvement across various intervention strategies and prior qualities. The authors provide theoretical results showing principled improvement under certain intervention schemes and demonstrate empirical gains in multiple control tasks, especially when interventions are noisy or less informative. The approach offers a practical pathway to safer, more reliable deployment of autonomous systems by harmonizing supervisor signals with priors through residual fine-tuning.

Abstract

Human interventions are a common source of data in autonomous systems during testing. These interventions provide an important signal about where the current policy needs improvement, but are often noisy and incomplete. We define Robust Intervention Learning (RIL) as the problem of learning from intervention data while remaining robust to the quality and informativeness of the intervention signal. In the best case, interventions are precise and avoiding them is sufficient to solve the task, but in many realistic settings avoiding interventions is necessary but not sufficient for achieving good performance. We study robust intervention learning in the context of emergency stop interventions and propose Residual Intervention Fine-Tuning (RIFT), a residual fine-tuning algorithm that treats intervention feedback as an incomplete learning signal and explicitly combines it with a prior policy. By framing intervention learning as a fine-tuning problem, our approach leverages structure encoded in the prior policy to resolve ambiguity when intervention signals under-specify the task. We provide theoretical analysis characterizing conditions under which this formulation yields principled policy improvement, and identify regimes where intervention learning is expected to fail. Our experiments reveal that residual fine-tuning enables robust and consistent policy improvement across a range of intervention strategies and prior policy qualities, and highlight robust intervention learning as a promising direction for future work.

Robust Intervention Learning from Emergency Stop Interventions

TL;DR

Abstract

Paper Structure (35 sections, 7 theorems, 49 equations, 13 figures, 2 tables, 2 algorithms)

This paper contains 35 sections, 7 theorems, 49 equations, 13 figures, 2 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Problem Statement
Maximum Entropy Objective
State-Action Visitation
Residual Q-Learning as a Fine-tuning Primitive
An Algorithm for RIL
Emergency Stop Interventions
RL Fine-Tuning for Interventions
RIFT: Residual Intervention Fine-Tuning
Provable Policy Improvement
Advantage Difference
Visitation Difference
State-Based Strategies
...and 20 more sections

Key Result

Theorem 4.1

Let $r : \mathcal{S} \times \mathcal{A} \to \mathbb R$ be any reward function and $\pi_0 : \mathcal{S} \to \textrm{Int} ( \Delta_{\mathcal{A}} )$ be any policy. Set the entropy coefficient $\alpha = \omega$. Then

Figures (13)

Figure 1: Robust Intervention Learning (RIL) recognizes that interventions are an imperfect signal about how to solve a task. While directly optimizing for avoiding interventions (grey) works well with highly informative interventions, the goal of RIL (blue) is to improve the prior policy (red) under many intervention strategies.
Figure 2: Depiction of hypothetical trajectories with e-stop interventions. In case 1, the intervention occurs to avoid an imminent catastrophic outcome (hitting the wall). In case 2, the intervention occurs because the robot is taking a wrong path, even though there is no immediate danger. In case 3, the intervention occurs because the robot gets stuck, even though the state it's in is close to those of successful expert demonstrations. A human expert might only intervene for one or two of these cases, yet all three are suboptimal.
Figure 3: Experimental results on the Lunar Lander environment using a prior policy with approximately 50% success rate and two different intervention strategies. Without prior policy regularization (RLIF), the policy forgets the information in the prior policy and needs highly informative interventions to succeed. With prior policy regularization (RIFT), the policy can combine the information in the prior policy with the information from the interventions to achieve a significantly higher success rate.
Figure 4: Experimental results on the Lunar Lander environment using different prior policies and intervention strategies.
Figure 5: Ablation experiments for the regularization coefficient $\omega$ used by RIFT. The optimal value of $\omega$ increases as the interventions become less informative. Near-optimal performance can be achieved with $\omega$ values in a 1 or 2 order of magnitude window, suggesting this parameter does not need to be extensively tuned.
...and 8 more figures

Theorems & Definitions (12)

Theorem 4.1
Theorem 5.1
proof
Theorem 1.1
Lemma 1.2
proof
Lemma 1.3
proof
Lemma 1.4
proof
...and 2 more

Robust Intervention Learning from Emergency Stop Interventions

TL;DR

Abstract

Robust Intervention Learning from Emergency Stop Interventions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (12)