Table of Contents
Fetching ...

Learning with Expert Abstractions for Efficient Multi-Task Continuous Control

Jeff Jewett, Sandhya Saisubramanian

TL;DR

This paper tackles the challenge of sample-inefficient learning in continuous, multi-task control with sparse rewards by leveraging expert-defined abstractions as high-level models. It introduces Goal-Conditioned Reward Shaping (GCRS), which plans over an abstract MDP to generate subgoals and uses plan-based potential shaping $\Phi^\tau(x)=V_{H^\tau}^{*}(\phi(x))$ to guide a single goal-conditioned controller $\pi(x,s_{next},\tau)$. The framework enables efficient learning, strong generalization, and zero-shot transfer across procedurally generated tasks, outperforming existing HRL and GCRL baselines in sample efficiency, scalability, and generalization. The empirical results on CocoGrid demonstrate the practical impact of incorporating expert abstractions, with implications for planning-informed policy learning in real-world continuous control problems.

Abstract

Decision-making in complex, continuous multi-task environments is often hindered by the difficulty of obtaining accurate models for planning and the inefficiency of learning purely from trial and error. While precise environment dynamics may be hard to specify, human experts can often provide high-fidelity abstractions that capture the essential high-level structure of a task and user preferences in the target environment. Existing hierarchical approaches often target discrete settings and do not generalize across tasks. We propose a hierarchical reinforcement learning approach that addresses these limitations by dynamically planning over the expert-specified abstraction to generate subgoals to learn a goal-conditioned policy. To overcome the challenges of learning under sparse rewards, we shape the reward based on the optimal state value in the abstract model. This structured decision-making process enhances sample efficiency and facilitates zero-shot generalization. Our empirical evaluation on a suite of procedurally generated continuous control environments demonstrates that our approach outperforms existing hierarchical reinforcement learning methods in terms of sample efficiency, task completion rate, scalability to complex tasks, and generalization to novel scenarios.

Learning with Expert Abstractions for Efficient Multi-Task Continuous Control

TL;DR

This paper tackles the challenge of sample-inefficient learning in continuous, multi-task control with sparse rewards by leveraging expert-defined abstractions as high-level models. It introduces Goal-Conditioned Reward Shaping (GCRS), which plans over an abstract MDP to generate subgoals and uses plan-based potential shaping to guide a single goal-conditioned controller . The framework enables efficient learning, strong generalization, and zero-shot transfer across procedurally generated tasks, outperforming existing HRL and GCRL baselines in sample efficiency, scalability, and generalization. The empirical results on CocoGrid demonstrate the practical impact of incorporating expert abstractions, with implications for planning-informed policy learning in real-world continuous control problems.

Abstract

Decision-making in complex, continuous multi-task environments is often hindered by the difficulty of obtaining accurate models for planning and the inefficiency of learning purely from trial and error. While precise environment dynamics may be hard to specify, human experts can often provide high-fidelity abstractions that capture the essential high-level structure of a task and user preferences in the target environment. Existing hierarchical approaches often target discrete settings and do not generalize across tasks. We propose a hierarchical reinforcement learning approach that addresses these limitations by dynamically planning over the expert-specified abstraction to generate subgoals to learn a goal-conditioned policy. To overcome the challenges of learning under sparse rewards, we shape the reward based on the optimal state value in the abstract model. This structured decision-making process enhances sample efficiency and facilitates zero-shot generalization. Our empirical evaluation on a suite of procedurally generated continuous control environments demonstrates that our approach outperforms existing hierarchical reinforcement learning methods in terms of sample efficiency, task completion rate, scalability to complex tasks, and generalization to novel scenarios.

Paper Structure

This paper contains 16 sections, 1 theorem, 4 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Let $\pi' : \mathcal{X} \times S \times \mathcal{T} \to \Delta(U)$ be the policy learned by Algorithm alg:rl_amdp as $N_\mathrm{steps} \to \infty$. Suppose the policy update on Line line:update_policy is an RL procedure that converges to an optimal policy under its usual assumptions. Then $\pi(x,\ta

Figures (6)

  • Figure 1: Overview of our proposed solution approach. An expert encodes their domain knowledge as an abstract model $\mathcal{H}$ with a mapping $\phi$ from continuous to abstract states. At each step for a given task $\tau$, map the current state $x$ to abstract state $\phi(x)$. Plan the best path with $\mathcal{H}^\tau$ and feed $s_\mathrm{next}$ into the policy to get next state $x'$. The agent is rewarded for reaching higher value abstract states.
  • Figure 2: Visualization of the "grid" and "room" abstractions. In the continuous state $x$, the red agent has grabbed a yellow key. In the grid abstraction, the agent is on a grid cell with the key. In the room abstraction, two rooms are separated by a yellow door. The agent and key are near each other (dotted circle) but not near the door.
  • Figure 3: Visualizing trajectories on DoorKey-8x8. GCRS and Plan-RS with $\mathcal{H}_\mathrm{grid}$ were successful.
  • Figure 4: Average success rates and standard deviation of different techniques, over 5 training runs.
  • Figure 5: Effect of scaling environment difficulty level on task completion rate. For U-Maze and ObjectDelivery, scale multiplies the physical size of the arena. DoorKey and LavaCrossing are scaled against the width of the grid and the number of lava walls respectively.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Proposition 1: Optimality
  • proof