Positive-Unlabeled Constraint Learning for Inferring Nonlinear Continuous Constraints Functions from Expert Demonstrations

Baiyu Peng; Aude Billard

Positive-Unlabeled Constraint Learning for Inferring Nonlinear Continuous Constraints Functions from Expert Demonstrations

Baiyu Peng, Aude Billard

TL;DR

The paper addresses inferring unknown nonlinear continuous constraints from expert demonstrations by formulating constraint learning as a two-step Positive-Unlabeled problem (PUCL). It alternates policy updates with constraint inference, first identifying reliable infeasible data via a distance-based score and then training a constraint classifier from feasible demonstrations and reliable infeasible data. The method yields flexible constraint boundaries without requiring explicit parameterization or environmental models and outperforms baselines in multiple constrained settings, enhancing policy safety. It also accommodates two policy representations (constrained RL and DSM) and demonstrates transfer to variant tasks, suggesting practical impact for robotics applications with implicit user preferences.

Abstract

Planning for diverse real-world robotic tasks necessitates to know and write all constraints. However, instances exist where these constraints are either unknown or challenging to specify accurately. A possible solution is to infer the unknown constraints from expert demonstration. This paper presents a novel two-step Positive-Unlabeled Constraint Learning (PUCL) algorithm to infer a continuous constraint function from demonstrations, without requiring prior knowledge of the true constraint parameterization or environmental model as existing works. We treat all data in demonstrations as positive (feasible) data, and learn a control policy to generate potentially infeasible trajectories, which serve as unlabeled data. The proposed two-step learning framework first identifies reliable infeasible data using a distance metric, and secondly learns a binary feasibility classifier (i.e., constraint function) from the feasible demonstrations and reliable infeasible data. The proposed method is flexible to learn complex-shaped constraint boundary and will not mistakenly classify demonstrations as infeasible as previous methods. The effectiveness of the proposed method is verified in four constrained environments, using a networked policy or a dynamical system policy. It successfully infers the continuous nonlinear constraints and outperforms other baseline methods in terms of constraint accuracy and policy safety. This work has been published in IEEE Robotics and Automation Letters (RA-L). Please refer to the final version at https://doi.org/10.1109/LRA.2024.3522756

Positive-Unlabeled Constraint Learning for Inferring Nonlinear Continuous Constraints Functions from Expert Demonstrations

TL;DR

Abstract

Paper Structure (14 sections, 9 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 9 equations, 6 figures, 2 tables, 1 algorithm.

Introduction
Preliminaries and Problem Statements
Preliminaries and Notations
Problem Formulation
Method
Motivation: Constraint Inference as Positive-Unlabeled Learning
Method
Iterative Learning Framework with Policy Filter and Memory Buffer
Represent and Learn Policy via Constrained RL or Dynamical System Modulation
Experiment
Comparison of Constraint Learning Methods
Comparison of Policy Representation and Learning Methods
Constraint Transfer to Variant of the Same Task
Conclusions and Limitations

Figures (6)

Figure 1: Learning an obstacle-avoidance constraint from demonstrations with proposed method. The task requires the robot to reach a target state while avoiding going close to or over the cups. These unknown constraints can be translated into an infeasible region that must be inferred from demonstrations. The top row illustrates one obstacle configuration, while the bottom row shows another. In each row, the left image displays the expert demonstrations (in green), the middle image shows the learned constraint region (in red), and the right image illustrates the learned policy, which is trained from the learned constraint and tested on a set of shifted goal states and starting points.
Figure 2: The framework of the proposed constraint learning algorithm PUCL. It alternates between updating policy using current constraint network, and inferring constraint from demonstrations and current policy. The constraint is inferred using a novel two-step positive-unlabeled learning technique. In the first step, the trajectories $\mathcal{P}$ sampled from current policy are viewed as unlabeled data, while the demonstrations are viewed as labeled feasible data. From the two datasets we identify reliable infeasible data using a distance-based metric. In the second step, the reliable infeasible data from current iteration and previous iterations, as well as the feasible demonstrations, are used to train a constraint network using a standard binary classification loss.
Figure 3: Sensitivity analysis of the distance threshold $d_r$ on performance. The left y-axis represents the unsafe rate (lower is better), and the right y-axis represents the IoU (higher is better). The curve is averaged on 5 runs with different random seeds.
Figure 4: True constraint and constraint learned with the proposed method PUCL and baseline MECL anwar2020InverseCR in the 2D reaching task. The two white ellipses represent the boundary of the true constraint. The heat map is a visualization of the learned constraint $\zeta_\theta$, where the red region of $\zeta=0$ represents infeasibility, while the blue regions of $\zeta=1$ indicate feasibility. The trajectories of the demonstrations and the policy are shown in green and yellow, respectively. The red points correspond to identified reliable infeasible data $\mathcal{R}\cup\mathcal{M}$ (only for PUCL method).
Figure 5: The IoU index (higher is better) and unsafe rate (lower is better) in three environments (top: 2D reaching, middle: 3D reaching, bottom: blocked half cheetah). The x-axis indicates the training process, the number of timesteps the agent takes in the environment. All the results are the average of 10 independent runs, and the shaded area represents one standard deviation.
...and 1 more figures

Theorems & Definitions (1)

Definition 1: Constraint Learning

Positive-Unlabeled Constraint Learning for Inferring Nonlinear Continuous Constraints Functions from Expert Demonstrations

TL;DR

Abstract

Positive-Unlabeled Constraint Learning for Inferring Nonlinear Continuous Constraints Functions from Expert Demonstrations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (1)