Table of Contents
Fetching ...

Going Beyond Expert Performance via Deep Implicit Imitation Reinforcement Learning

Iason Chrysomallis, Georgios Chalkiadakis

TL;DR

This work tackles imitation learning from observation-only, potentially suboptimal expert data by proposing a deep implicit imitation RL framework that blends expert guidance with independent environmental interaction. The core method, DIIQN, infers expert actions online, samples from expert observations, and uses a dynamic confidence mechanism to balance expert- and self-guided learning, enabling performance beyond the observed demonstrations. An extension, HA-DIIQN, addresses heterogeneous action spaces by identifying infeasible transitions and discovering feasible bridges, allowing knowledge transfer across agents with different capabilities. Experimental results show up to 130% gains over a DQN baseline and up to 64% faster learning in heterogeneous settings, with robust performance across dataset sizes and hyperparameters, highlighting the practical potential for leveraging suboptimal, observation-only data in real-world RL tasks.

Abstract

Imitation learning traditionally requires complete state-action demonstrations from optimal or near-optimal experts. These requirements severely limit practical applicability, as many real-world scenarios provide only state observations without corresponding actions and expert performance is often suboptimal. In this paper we introduce a deep implicit imitation reinforcement learning framework that addresses both limitations by combining deep reinforcement learning with implicit imitation learning from observation-only datasets. Our main algorithm, Deep Implicit Imitation Q-Network (DIIQN), employs an action inference mechanism that reconstructs expert actions through online exploration and integrates a dynamic confidence mechanism that adaptively balances expert-guided and self-directed learning. This enables the agent to leverage expert guidance for accelerated training while maintaining capacity to surpass suboptimal expert performance. We further extend our framework with a Heterogeneous Actions DIIQN (HA-DIIQN) algorithm to tackle scenarios where expert and agent possess different action sets, a challenge previously unaddressed in the implicit imitation learning literature. HA-DIIQN introduces an infeasibility detection mechanism and a bridging procedure identifying alternative pathways connecting agent capabilities to expert guidance when direct action replication is impossible. Our experimental results demonstrate that DIIQN achieves up to 130% higher episodic returns compared to standard DQN, while consistently outperforming existing implicit imitation methods that cannot exceed expert performance. In heterogeneous action settings, HA-DIIQN learns up to 64% faster than baselines, leveraging expert datasets unusable by conventional approaches. Extensive parameter sensitivity analysis reveals the framework's robustness across varying dataset sizes and hyperparameter configurations.

Going Beyond Expert Performance via Deep Implicit Imitation Reinforcement Learning

TL;DR

This work tackles imitation learning from observation-only, potentially suboptimal expert data by proposing a deep implicit imitation RL framework that blends expert guidance with independent environmental interaction. The core method, DIIQN, infers expert actions online, samples from expert observations, and uses a dynamic confidence mechanism to balance expert- and self-guided learning, enabling performance beyond the observed demonstrations. An extension, HA-DIIQN, addresses heterogeneous action spaces by identifying infeasible transitions and discovering feasible bridges, allowing knowledge transfer across agents with different capabilities. Experimental results show up to 130% gains over a DQN baseline and up to 64% faster learning in heterogeneous settings, with robust performance across dataset sizes and hyperparameters, highlighting the practical potential for leveraging suboptimal, observation-only data in real-world RL tasks.

Abstract

Imitation learning traditionally requires complete state-action demonstrations from optimal or near-optimal experts. These requirements severely limit practical applicability, as many real-world scenarios provide only state observations without corresponding actions and expert performance is often suboptimal. In this paper we introduce a deep implicit imitation reinforcement learning framework that addresses both limitations by combining deep reinforcement learning with implicit imitation learning from observation-only datasets. Our main algorithm, Deep Implicit Imitation Q-Network (DIIQN), employs an action inference mechanism that reconstructs expert actions through online exploration and integrates a dynamic confidence mechanism that adaptively balances expert-guided and self-directed learning. This enables the agent to leverage expert guidance for accelerated training while maintaining capacity to surpass suboptimal expert performance. We further extend our framework with a Heterogeneous Actions DIIQN (HA-DIIQN) algorithm to tackle scenarios where expert and agent possess different action sets, a challenge previously unaddressed in the implicit imitation learning literature. HA-DIIQN introduces an infeasibility detection mechanism and a bridging procedure identifying alternative pathways connecting agent capabilities to expert guidance when direct action replication is impossible. Our experimental results demonstrate that DIIQN achieves up to 130% higher episodic returns compared to standard DQN, while consistently outperforming existing implicit imitation methods that cannot exceed expert performance. In heterogeneous action settings, HA-DIIQN learns up to 64% faster than baselines, leveraging expert datasets unusable by conventional approaches. Extensive parameter sensitivity analysis reveals the framework's robustness across varying dataset sizes and hyperparameter configurations.

Paper Structure

This paper contains 36 sections, 17 equations, 12 figures, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of action space heterogeneity. (a) Homogeneous scenario where the expert and agent share the same orthogonal actions. (b) Heterogeneous scenario where the agent acts in diagonal actions instead.
  • Figure 2: Starting from a similar initial state between the agent and the expert ($s_a \approx s_e$), the trajectories diverge as each takes a different action. The agent executes action $a_a$, leading to the transition $s_a \rightarrow s_a'$, while the expert performs action $a_e$, resulting in the transition $s_e \rightarrow s_e'$.
  • Figure 3: Expert sampling procedure with KNN search and similarity filtering. The expert dataset (outer region) contains all available expert transitions. A KNN search identifies the $k$ nearest expert states to the current agent state (middle region). From these candidates, only those within the similarity threshold $\tau_{similar}$ are retained as satisfactory candidates (inner region). The arrow illustrates the maximum acceptable distance $\tau_{similar}$ between states for them to be considered similar. One expert transition is randomly selected from the satisfactory candidates to provide guidance for the current training step. If no candidates satisfy the similarity threshold, expert guidance is skipped for that step.
  • Figure 4: Illustration of bridge discovery in heterogeneous action spaces. Starting from similar states $s_a \approx s_e$, the expert executes infeasible action $a_{infeas}$ to transition to $s_e'$, which the agent cannot replicate. Instead, the agent discovers a feasible bridge path via $s_a \xrightarrow{a_{feas}} s_1 = s_{feas} \rightarrow s_2$ that intersects with the expert's downstream trajectory at state $s_e"$. This bridge enables the agent to align with the expert's strategic guidance despite the action space heterogeneity.
  • Figure 5: Cross-referencing mechanism for trajectory expansion. Within a single dataset, similar states occurring at different positions enable trajectory recombination. Original path: Following the natural sequential order (e.g., $s_1 \rightarrow s_2 \rightarrow s_3$ or $s_1^{\#} \rightarrow s_2 \rightarrow s_3^{\#}$). New discovered path: By switching at similar states, alternative trajectories are created by combining segments from different parts of the dataset ($s_1 \rightarrow s_2 \rightarrow s_3^{\#}$). This cross-referencing mechanism expands the set of available paths beyond the original sequences, providing additional options for bridge discovery in heterogeneous action settings.
  • ...and 7 more figures