Table of Contents
Fetching ...

Learning Constraint Network from Demonstrations via Positive-Unlabeled Learning with Memory Replay

Baiyu Peng, Aude Billard

TL;DR

This work addresses inferring unknown, potentially nonlinear planning constraints from expert demonstrations by formulating constraint learning as Positive-Unlabeled (PU) learning under the SCAR assumption. Demonstrations are treated as positives while high-reward trajectories from the learner are unlabeled, enabling recovery of a constraint function $\zeta_\theta(s)$ that induces a feasible set $\mathcal{C}_\theta=\{s:\zeta_\theta(s)\le d\}$ via a postprocessing threshold $d=0.5f$. The method alternates constrained RL (PPO-penalty) with PU-based constraint inference and introduces Constraint Memory Replay to prevent forgetting of previously learned infeasible regions. Empirical results across three Mujoco tasks demonstrate accurate recovery of continuous nonlinear constraints and superior constraint accuracy and safety compared with a maximum-entropy baseline MECL, with notable gains from memory replay. This approach enables safer, constraint-aware planning in real-world robotics where true constraint models are difficult to specify.

Abstract

Planning for a wide range of real-world tasks necessitates to know and write all constraints. However, instances exist where these constraints are either unknown or challenging to specify accurately. A possible solution is to infer the unknown constraints from expert demonstration. The majority of prior works limit themselves to learning simple linear constraints, or require strong knowledge of the true constraint parameterization or environmental model. To mitigate these problems, this paper presents a positive-unlabeled (PU) learning approach to infer a continuous, arbitrary and possibly nonlinear, constraint from demonstration. From a PU learning view, We treat all data in demonstrations as positive (feasible) data, and learn a (sub)-optimal policy to generate high-reward-winning but potentially infeasible trajectories, which serve as unlabeled data containing both feasible and infeasible states. Under an assumption on data distribution, a feasible-infeasible classifier (i.e., constraint model) is learned from the two datasets through a postprocessing PU learning technique. The entire method employs an iterative framework alternating between updating the policy, which generates and selects higher-reward policies, and updating the constraint model. Additionally, a memory buffer is introduced to record and reuse samples from previous iterations to prevent forgetting. The effectiveness of the proposed method is validated in two Mujoco environments, successfully inferring continuous nonlinear constraints and outperforming a baseline method in terms of constraint accuracy and policy safety.

Learning Constraint Network from Demonstrations via Positive-Unlabeled Learning with Memory Replay

TL;DR

This work addresses inferring unknown, potentially nonlinear planning constraints from expert demonstrations by formulating constraint learning as Positive-Unlabeled (PU) learning under the SCAR assumption. Demonstrations are treated as positives while high-reward trajectories from the learner are unlabeled, enabling recovery of a constraint function that induces a feasible set via a postprocessing threshold . The method alternates constrained RL (PPO-penalty) with PU-based constraint inference and introduces Constraint Memory Replay to prevent forgetting of previously learned infeasible regions. Empirical results across three Mujoco tasks demonstrate accurate recovery of continuous nonlinear constraints and superior constraint accuracy and safety compared with a maximum-entropy baseline MECL, with notable gains from memory replay. This approach enables safer, constraint-aware planning in real-world robotics where true constraint models are difficult to specify.

Abstract

Planning for a wide range of real-world tasks necessitates to know and write all constraints. However, instances exist where these constraints are either unknown or challenging to specify accurately. A possible solution is to infer the unknown constraints from expert demonstration. The majority of prior works limit themselves to learning simple linear constraints, or require strong knowledge of the true constraint parameterization or environmental model. To mitigate these problems, this paper presents a positive-unlabeled (PU) learning approach to infer a continuous, arbitrary and possibly nonlinear, constraint from demonstration. From a PU learning view, We treat all data in demonstrations as positive (feasible) data, and learn a (sub)-optimal policy to generate high-reward-winning but potentially infeasible trajectories, which serve as unlabeled data containing both feasible and infeasible states. Under an assumption on data distribution, a feasible-infeasible classifier (i.e., constraint model) is learned from the two datasets through a postprocessing PU learning technique. The entire method employs an iterative framework alternating between updating the policy, which generates and selects higher-reward policies, and updating the constraint model. Additionally, a memory buffer is introduced to record and reuse samples from previous iterations to prevent forgetting. The effectiveness of the proposed method is validated in two Mujoco environments, successfully inferring continuous nonlinear constraints and outperforming a baseline method in terms of constraint accuracy and policy safety.
Paper Structure (13 sections, 10 equations, 9 figures, 1 table)

This paper contains 13 sections, 10 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The framework of the proposed method. It alternates between two steps: learning policy using Constrained RL and learning constraint from policy and demonstrations with positive-unlabeled learning.
  • Figure 2: The illustration of constraint inference from the PU learning perspective and the example of SCAR assumption. The red dots denote states from the demonstration, the blue dots denote states from the high-reward trajectory, the red lines denote the unknown constraint boundary and black circle denote the target trajectory given by the reward function.
  • Figure 3: Illustration of "constraint forgetting" problem, where the constrained areas already learned in previous iterations may be forgotten later. The trajectories of the demonstrations and the policy are shown in blue and black, respectively. The red rectangle represents the constrained areas to be inferred and the black rectangle represents a known obstacle.
  • Figure 4: The constraint learning visualization of Point-Circle environment. (a) True constraint and demonstrations (in green dots). (b) Learned constraint and policy (in red dots). The x-axis and y-axis are exactly the coordinates of the point robot. The colormap visualizes true constraint function $\zeta^*(x,y)$, where the red area is the true or learned infeasible area, while the (light) blue area is feasible. The yellow points correspond to data stored in the memory buffer.
  • Figure 5: The IoU index (higher is better) and constraint violation rate (lower is better) learning curve in three environments. The x-axis in all plots corresponds to the number of training timesteps.
  • ...and 4 more figures