Table of Contents
Fetching ...

Online Adaptation for Enhancing Imitation Learning Policies

Federico Malato, Ville Hautamaki

TL;DR

This work addresses the fragility of imitation learning when expert datasets fail to fully capture task dynamics. It introduces Bayesian online adaptation (BOA), which combines an IL policy's action distribution with expert-derived actions via a Dirichlet-Multinomial framework and a retrieval-based expert search. By updating action beliefs in real time using counts from retrieved expert experiences and sampling from the posterior, BOA enhances robustness and can rescue policies that would otherwise fail. Across ten MiniWorld tasks, BOA improves numerical rewards and provides perceptual benefits, while maintaining interpretable action selection dynamics and real-time inference via efficient search.

Abstract

Imitation learning enables autonomous agents to learn from human examples, without the need for a reward signal. Still, if the provided dataset does not encapsulate the task correctly, or when the task is too complex to be modeled, such agents fail to reproduce the expert policy. We propose to recover from these failures through online adaptation. Our approach combines the action proposal coming from a pre-trained policy with relevant experience recorded by an expert. The combination results in an adapted action that closely follows the expert. Our experiments show that an adapted agent performs better than its pure imitation learning counterpart. Notably, adapted agents can achieve reasonable performance even when the base, non-adapted policy catastrophically fails.

Online Adaptation for Enhancing Imitation Learning Policies

TL;DR

This work addresses the fragility of imitation learning when expert datasets fail to fully capture task dynamics. It introduces Bayesian online adaptation (BOA), which combines an IL policy's action distribution with expert-derived actions via a Dirichlet-Multinomial framework and a retrieval-based expert search. By updating action beliefs in real time using counts from retrieved expert experiences and sampling from the posterior, BOA enhances robustness and can rescue policies that would otherwise fail. Across ten MiniWorld tasks, BOA improves numerical rewards and provides perceptual benefits, while maintaining interpretable action selection dynamics and real-time inference via efficient search.

Abstract

Imitation learning enables autonomous agents to learn from human examples, without the need for a reward signal. Still, if the provided dataset does not encapsulate the task correctly, or when the task is too complex to be modeled, such agents fail to reproduce the expert policy. We propose to recover from these failures through online adaptation. Our approach combines the action proposal coming from a pre-trained policy with relevant experience recorded by an expert. The combination results in an adapted action that closely follows the expert. Our experiments show that an adapted agent performs better than its pure imitation learning counterpart. Notably, adapted agents can achieve reasonable performance even when the base, non-adapted policy catastrophically fails.
Paper Structure (23 sections, 4 equations, 5 figures, 1 table)

This paper contains 23 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: A visual explanation of our proposed method. At timestep $t$, the current observation is fed to the imitation learning policy network to obtain a policy action distribution. Concurrently, we retrieve a number of resembling frames from the expert data and compute an expert action distribution. The two distributions are then combined to obtain a joint, adapted action distribution. Finally, the action is selected by sampling from the joint distribution.
  • Figure 2: Screenshots from the $10$ MiniWorld environments used in our experiments. MiniWorld uses minimal graphics, while still providing variance in the visual domain. From left to right, top to bottom: CollectHealth, FourRooms, Hallway, MazeS3, OneRoom, PutNext, Sidewalk, TMaze, WallGap, YMaze. Images are upscaled to $800 \times 600$ for visual clarity.
  • Figure 3: Mean success rate for different numbers of retrieved samples. Each graph corresponds to one environment. Measurements retrieved on $3$ runs of $30$ episodes each. The best value is marked with a star. Red lines represent a BOA agent adapting GAIL agent, while a blue line denotes a BC agent adapted with BOA.
  • Figure 4: Mean success rate for different numbers of encoded trajectories. The test is conducted on both BOA+BC (in blue) and BOA+GAIL (red line) on $5$ runs of $30$ episodes each. $n$ varies between $1$ and $150$. We highlight the best value for each agent with a star-shaped mark.
  • Figure 5: Average reward comparison over the $10$ selected tasks. Higher is better. In all environments, the reward is in the range $[0, 1]$ except for CollectHealth, where the reward is in $[-2, +\infty)$. We highlight BOA agents with striped bars and link them to the corresponding IL agent by matching the color of the bar. Grey bars represent baseline methods.