SEABO: A Simple Search-Based Method for Offline Imitation Learning

Jiafei Lyu; Xiaoteng Ma; Le Wan; Runze Liu; Xiu Li; Zongqing Lu

SEABO: A Simple Search-Based Method for Offline Imitation Learning

Jiafei Lyu, Xiaoteng Ma, Le Wan, Runze Liu, Xiu Li, Zongqing Lu

TL;DR

SEABO addresses the challenge of reward design in offline imitation learning by a simple, unsupervised, search-based labeling of unlabeled transitions. It builds a KD-tree on expert demonstrations and assigns rewards to unlabeled transitions based on their distance to the nearest expert neighbor, using the formula $r = \alpha \exp\left(-\dfrac{\beta \times d}{|\mathcal{A}|}\right)$. The method is agnostic to the underlying offline RL algorithm and works with either action-containing or state-only expert data, yielding competitive or superior results on D4RL benchmarks, AntMaze, and Adroit domains even with a single expert trajectory. This approach avoids discriminator or reward-model training, offering a simple, efficient, and broadly applicable tool for offline IL with practical impact across robotics and control tasks.

Abstract

Offline reinforcement learning (RL) has attracted much attention due to its ability in learning from static offline datasets and eliminating the need of interacting with the environment. Nevertheless, the success of offline RL relies heavily on the offline transitions annotated with reward labels. In practice, we often need to hand-craft the reward function, which is sometimes difficult, labor-intensive, or inefficient. To tackle this challenge, we set our focus on the offline imitation learning (IL) setting, and aim at getting a reward function based on the expert data and unlabeled data. To that end, we propose a simple yet effective search-based offline IL method, tagged SEABO. SEABO allocates a larger reward to the transition that is close to its closest neighbor in the expert demonstration, and a smaller reward otherwise, all in an unsupervised learning manner. Experimental results on a variety of D4RL datasets indicate that SEABO can achieve competitive performance to offline RL algorithms with ground-truth rewards, given only a single expert trajectory, and can outperform prior reward learning and offline IL methods across many tasks. Moreover, we demonstrate that SEABO also works well if the expert demonstrations contain only observations. Our code is publicly available at https://github.com/dmksjfl/SEABO.

SEABO: A Simple Search-Based Method for Offline Imitation Learning

TL;DR

. The method is agnostic to the underlying offline RL algorithm and works with either action-containing or state-only expert data, yielding competitive or superior results on D4RL benchmarks, AntMaze, and Adroit domains even with a single expert trajectory. This approach avoids discriminator or reward-model training, offering a simple, efficient, and broadly applicable tool for offline IL with practical impact across robotics and control tasks.

Abstract

Paper Structure (23 sections, 4 equations, 13 figures, 15 tables, 1 algorithm)

This paper contains 23 sections, 4 equations, 13 figures, 15 tables, 1 algorithm.

Introduction
Preliminary
Related Work
Offline Imitation Learning via Search-Based Method
Experiments
Main Results
Comparison Against Offline IL Algorithms
State-only Regimes
Comparison of Different Search Algorithms
Parameter Study
Conclusion
Hyperparameter Setup
Missing Experimental Results
Numerical Comparison Under Ten Expert Demonstrations
Comparison of TD3_BC+OTR and TD3_BC+SEABO
...and 8 more sections

Figures (13)

Figure 1: Left: The key idea behind SEABO. We assign larger rewards to transitions that are closer to the expert demonstration, and smaller rewards otherwise. The dotted lines connect the query samples with their nearest neighbors along the demonstration. Right: Illustration of the SEABO framework. Given an expert demonstration, we first construct a KD-tree and then feed the unlabeled samples into the tree to query their nearest neighbors. We use the resulting distance to calculate the reward label. Then one can adopt any existing offline RL algorithm to train on the labeled dataset.
Figure 2: Density plots of ground-truth rewards and rewards acquired by SEABO. Note that oracle indicates the ground-truth rewards are plotted.
Figure 3: Parameter study on the reward scale. The shaded region denotes the standard deviation.
Figure 4: Parameter study of (a) weighting coefficient $\beta$, (b) number of neighbors $N$. The shaded region captures the standard deviation.
Figure 5: Additional experiments on the influence of $\alpha$. The shaded region captures the standard deviation. All other hyperparameters are kept unchanged except $\alpha$.
...and 8 more figures

SEABO: A Simple Search-Based Method for Offline Imitation Learning

TL;DR

Abstract

SEABO: A Simple Search-Based Method for Offline Imitation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (13)