Table of Contents
Fetching ...

Learning from Imperfect Demonstrations via Temporal Behavior Tree-Guided Trajectory Repair

Aniruddh G. Puranic, Sebastian Schirmer, John S. Baras, Calin Belta

Abstract

Learning robot control policies from demonstrations is a powerful paradigm, yet real-world data is often suboptimal, noisy, or otherwise imperfect, posing significant challenges for imitation and reinforcement learning. In this work, we present a formal framework that leverages Temporal Behavior Trees (TBT), an extension of Signal Temporal Logic (STL) with Behavior Tree semantics, to repair suboptimal trajectories prior to their use in downstream policy learning. Given demonstrations that violate a TBT specification, a model-based repair algorithm corrects trajectory segments to satisfy the formal constraints, yielding a dataset that is both logically consistent and interpretable. The repaired trajectories are then used to extract potential functions that shape the reward signal for reinforcement learning, guiding the agent toward task-consistent regions of the state space without requiring knowledge of the agent's kinematic model. We demonstrate the effectiveness of this framework on discrete grid-world navigation and continuous single and multi-agent reach-avoid tasks, highlighting its potential for data-efficient robot learning in settings where high-quality demonstrations cannot be assumed.

Learning from Imperfect Demonstrations via Temporal Behavior Tree-Guided Trajectory Repair

Abstract

Learning robot control policies from demonstrations is a powerful paradigm, yet real-world data is often suboptimal, noisy, or otherwise imperfect, posing significant challenges for imitation and reinforcement learning. In this work, we present a formal framework that leverages Temporal Behavior Trees (TBT), an extension of Signal Temporal Logic (STL) with Behavior Tree semantics, to repair suboptimal trajectories prior to their use in downstream policy learning. Given demonstrations that violate a TBT specification, a model-based repair algorithm corrects trajectory segments to satisfy the formal constraints, yielding a dataset that is both logically consistent and interpretable. The repaired trajectories are then used to extract potential functions that shape the reward signal for reinforcement learning, guiding the agent toward task-consistent regions of the state space without requiring knowledge of the agent's kinematic model. We demonstrate the effectiveness of this framework on discrete grid-world navigation and continuous single and multi-agent reach-avoid tasks, highlighting its potential for data-efficient robot learning in settings where high-quality demonstrations cannot be assumed.

Paper Structure

This paper contains 14 sections, 11 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the proposed framework. The TBT specification $\mathcal{T}$ governs both the repair of suboptimal demonstrations and the construction of the TBT monitor and reward potentials used during RL training.
  • Figure 2: Grid-World environment: (a): Blue indicates start location at the top-left, red is the goal at the bottom-right, and the black cells are obstacles; (b) Example demonstrations that violate the TBT specification by visiting an unsafe cell. (c) Heatmaps of the extracted potential functions from the cleaned trajectories; (d) Comparison of the mean episodic rewards and success rates.
  • Figure 3: Reach-avoid navigation task: (a) A close-up of our experiment setup showing a mobile robot (indicated by a car-like object) tasked with navigating to a uniformly randomized 2D goal location (green cylinder); (b) Multi-agent extension of the single-agent environment. The robots are indicated by red and blue car-like objects and the goals are indicated by red and blue cylinders. (c)--(e) Each robot must avoid two types of obstacles: a traversable soft constraint (purple discs) and hard constraints that inhibit motion upon collision (cyan cubes). These hazards incur costs that the agent must minimize while learning to safely navigate to the goal.
  • Figure 4: Comparisons between the proposed framework and RL baselines in the Safety-Gymnasium reach-avoid tasks.

Theorems & Definitions (5)

  • Definition 3.1: Markov Decision Process
  • Definition 3.2: Trajectory or Episode Rollout
  • Definition 3.3: Syntax of Temporal Behavior Trees schirmer2024_tbt
  • Definition 3.4: Robustness Semantics of TBTs
  • Definition 3.5: STL Robust Semantics