Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models

Xingzhou Lou; Junge Zhang; Ziyan Wang; Kaiqi Huang; Yali Du

Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models

Xingzhou Lou, Junge Zhang, Ziyan Wang, Kaiqi Huang, Yali Du

TL;DR

This work tackles safe reinforcement learning when constraints come as free-form natural language and ground-truth cost functions are unavailable. It proposes a cost-prediction module built from a decoder LM to condense constraints and an encoder LM to embed constraints and text observations, using a contrastive loss to align semantically similar constraints and a cosine-threshold rule to predict violations. The LM-based costs are integrated into PPO via a Lagrangian objective, enabling agents to maximize rewards while respecting a constraint budget without access to true costs. Empirical results on Hazard-World-Grid and SafetyGoal demonstrate that the method achieves strong task performance with adherence to constraints, and extensive ablations validate the necessity of both encoder and decoder LMs, as well as the contrastive objective. This approach broadens safe RL applicability by leveraging pre-trained LMs to handle diverse, free-form language constraints and reduces the need for domain-specific cost design.

Abstract

Safe reinforcement learning (RL) agents accomplish given tasks while adhering to specific constraints. Employing constraints expressed via easily-understandable human language offers considerable potential for real-world applications due to its accessibility and non-reliance on domain expertise. Previous safe RL methods with natural language constraints typically adopt a recurrent neural network, which leads to limited capabilities when dealing with various forms of human language input. Furthermore, these methods often require a ground-truth cost function, necessitating domain expertise for the conversion of language constraints into a well-defined cost function that determines constraint violation. To address these issues, we proposes to use pre-trained language models (LM) to facilitate RL agents' comprehension of natural language constraints and allow them to infer costs for safe policy learning. Through the use of pre-trained LMs and the elimination of the need for a ground-truth cost, our method enhances safe policy learning under a diverse set of human-derived free-form natural language constraints. Experiments on grid-world navigation and robot control show that the proposed method can achieve strong performance while adhering to given constraints. The usage of pre-trained LMs allows our method to comprehend complicated constraints and learn safe policies without the need for ground-truth cost at any stage of training or evaluation. Extensive ablation studies are conducted to demonstrate the efficacy of each part of our method.

Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models

TL;DR

Abstract

Paper Structure (18 sections, 9 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 9 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Related Work
Safe Reinforcement Learning
Reinforcement Learning with Natural Language
Pre-Trained Language Models
Preliminaries
Methodology
Cost Prediction Module
Policy Training
Experiment
Experiment Settings
General Results
Ablation Study
Ablation on Contrastive Loss
Ablation on Encoder LM
...and 3 more sections

Figures (5)

Figure 1: Cost prediction in the proposed method. The decoder LM condenses the semantic meaning of the constraint to eliminate ambiguity and redundancy. The encoder LM encodes the condensed constraints and text-based observations into embeddings according to their semantic meaning. If cosine similarity between the embeddings is greater than threshold $T$, the model will predict the constraint is violated (predicted cost $\hat{c}=1$), otherwise $\hat{c}=0$. Embedding $h_c$ is also used later as input to condition the policy network.
Figure 2: (a) One layout in Hazard-World-Grid, where orange tiles are lava, blue tiles are water and green tiles are grass. (c) Robot navigation task SafetyGoal built-in Safety-Gymnasium Safety-Gymnasium, where there are multiple types of objects in the environment. In both environments (a) and (c), agents have to reach goals while avoiding some type of terrain or objects specified in the natural language constraints. (b) Constraint examples for two environments in our experiments. Compared to constraints by structured language in previous works, constraints in our experiments are much more free-form and less intuitive.
Figure 3: Experiment results on Hazard-World-Grid and SafetyGoal. random and longpath are two layouts in Hazard-World-Grid. Easy, Medium and Hard are three levels in SafetyGoal. There are two objects of each hazard type in Easy, four in Medium and six in Hard. Thus, in total, there are 8 hazard objects in Easy, 16 in Medium and 24 in Hard. PPO-Lag refers to the baseline method PPO-Lagrangian, and PPO-CP refers to the proposed method, where the suffix CP means cost prediction. From the results, the proposed method successfully learns a safer policy with the predicted cost. CPO is proposed for robot locomotion tasks. Thus, we do not include it in the first two tasks from Hazard-World-Grid. It is worth noting that in some tasks, the proposed method learns safer policies than the baselines using ground-truth costs. This is because the cost prediction module may mistakenly think of some safe but risky actions as violating the constraints, resulting in a more conservative policy.
Figure 4: (a) gives the result of episode reward and (b) gives the result of episode cost as training goes on. The labels are in the format of 'decoder model + encoder model + cost type'. For example, 'GPT + BERT + Predicted Cost' stands for using GPT as decoder LM, BERT as encoder LM and predicted cost from our cost prediction module for policy learning. 'GPT + BERT + Predicted Cost' is our proposed method PPO-CP, and 'GPT + BERT + Ground-truth Cost' is PPO-Lag in Fig. \ref{['fig:main_res']}. (c) gives the cost prediction results when using BERT and LSTM as the encoder model. The cost prediction with LSTM as encoder has very poor performance, which leads to the collapsed training of 'GPT + LSTM + Predicted Cost' in (a) and (b).
Figure 5: Experiment results for ablating decoder LM and directly prompting GPT for cost prediction. w decoder LM is our proposed method, w/o decoder LM removes the decoder LM and keeps the other modules such as cost prediction, Direct Prompting GPT-3 removes our cost prediction module and query GPT-3 with the constraint and text-based observation for cost prediction, and Direct Prompting GPT-4 replace GPT-3 with GPT-4 in Direct Prompting GPT-3. (a) gives the results of the episode reward. (b) is the results of episode cost and (c) gives the cost prediction results. The proposed method achieves the best performance on all three metrics while Directly prompting GPT-3 to perform the worst.

Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models

TL;DR

Abstract

Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)