Safety through feedback in Constrained RL

Shashank Reddy Chirra; Pradeep Varakantham; Praveen Paruchuri

Safety through feedback in Constrained RL

Shashank Reddy Chirra, Pradeep Varakantham, Praveen Paruchuri

TL;DR

This work tackles safety in constrained reinforcement learning when the cost function is unknown and must be learned from offline safety feedback. It introduces Reinforcement Learning from Safety Feedback (RLSF), which converts horizon-spanning trajectory feedback into a tractable state-level binary cost via a surrogate likelihood objective and a novel surrogate loss, while reducing annotation burden with novelty-based trajectory sampling. The approach demonstrates effective cost inference, safe policy learning, and transfer of inferred costs across embodiments in diverse Safety Gymnasium and driving scenarios, achieving strong performance relative to baselines and approaching the best known-cost methods. It also analyzes biases from trajectory-level feedback, proposes a bias-correction heuristic, and shows the value of transferring inferred costs to new agents, highlighting practical pathways for safer autonomous systems when explicit cost design is challenging.

Abstract

In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent's safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the domain of self-driving, designing a cost function that encompasses all unsafe behaviours (e.g. aggressive lane changes) is inherently complex. In such scenarios, the cost function can be learned from feedback collected offline in between training rounds. This feedback can be system generated or elicited from a human observing the training process. Previous approaches have not been able to scale to complex environments and are constrained to receiving feedback at the state level which can be expensive to collect. To this end, we introduce an approach that scales to more complex domains and extends to beyond state-level feedback, thus, reducing the burden on the evaluator. Inferring the cost function in such settings poses challenges, particularly in assigning credit to individual states based on trajectory-level feedback. To address this, we propose a surrogate objective that transforms the problem into a state-level supervised classification task with noisy labels, which can be solved efficiently. Additionally, it is often infeasible to collect feedback on every trajectory generated by the agent, hence, two fundamental questions arise: (1) Which trajectories should be presented to the human? and (2) How many trajectories are necessary for effective learning? To address these questions, we introduce \textit{novelty-based sampling} that selectively involves the evaluator only when the the agent encounters a \textit{novel} trajectory. We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios.

Safety through feedback in Constrained RL

TL;DR

Abstract

Paper Structure (40 sections, 9 theorems, 20 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 40 sections, 9 theorems, 20 equations, 14 figures, 4 tables, 1 algorithm.

Introduction
Preliminaries
Markov Decision Process
Constrained Markov Decision Process
Problem Definition
Method
Nature of the Feedback
Inferring the Cost Function
Efficient Subsampling of Trajectories
Policy Optimization
Experiments
Experiment Setup
Cost Inference across various tasks
Benchmark Environments
Driving Scenarios
...and 25 more sections

Key Result

Proposition 1

The surrogate loss $L^{sur}$ is an upper bound on the likelihood loss $L^{mle}$.

Figures (14)

Figure 1: Cost Violation rate of different algorithms in the Driver environment. Each algorithm is run for $6$ independent seeds, with the curves representing the mean and the shaded regions indicating the standard error.
Figure 2: Comparison of different sampling and scheduling schemes. Results are averaged over $3$ independent seeds. The proposed sampling method generates on average $\approx1950$ queries, hence for fair comparison the other methods were given a budget of $2000$ queries.
Figure 3: Comparing the inferred cost to the true cost.
Figure 4: Increasing $c_{max}$ by $\delta$ to correct for the overestimation bias.
Figure 5: Driver Environments
...and 9 more figures

Theorems & Definitions (14)

Proposition 1
Proposition 2
Proposition 3
Corollary 1
Proposition 1
proof
Lemma 1
proof
Proposition 2
proof
...and 4 more

Safety through feedback in Constrained RL

TL;DR

Abstract

Safety through feedback in Constrained RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (14)