Table of Contents
Fetching ...

HYDRA: Hybrid Robot Actions for Imitation Learning

Suneel Belkhale, Yuchen Cui, Dorsa Sadigh

TL;DR

HYDRA tackles imitation-learning distribution shift by introducing a two-level action representation that combines sparse waypoints with dense low-level actions and by performing offline action relabeling to boost dataset consistency. The method uses a multi-headed architecture to predict modes, waypoints, and actions, enabling dynamic switching between coarse and fine-grained control at test time. Empirical results across seven long-horizon manipulation tasks in simulation and the real world show 30-40% improvements over strong baselines, with ablations highlighting the benefits of action relabeling and hybrid action spaces. HYDRA demonstrates robust performance in challenging, real-world robotics tasks and offers a practical approach to balancing dexterity and data efficiency in imitation learning.

Abstract

Imitation Learning (IL) is a sample efficient paradigm for robot learning using expert demonstrations. However, policies learned through IL suffer from state distribution shift at test time, due to compounding errors in action prediction which lead to previously unseen states. Choosing an action representation for the policy that minimizes this distribution shift is critical in imitation learning. Prior work propose using temporal action abstractions to reduce compounding errors, but they often sacrifice policy dexterity or require domain-specific knowledge. To address these trade-offs, we introduce HYDRA, a method that leverages a hybrid action space with two levels of action abstractions: sparse high-level waypoints and dense low-level actions. HYDRA dynamically switches between action abstractions at test time to enable both coarse and fine-grained control of a robot. In addition, HYDRA employs action relabeling to increase the consistency of actions in the dataset, further reducing distribution shift. HYDRA outperforms prior imitation learning methods by 30-40% on seven challenging simulation and real world environments, involving long-horizon tasks in the real world like making coffee and toasting bread. Videos are found on our website: https://tinyurl.com/3mc6793z

HYDRA: Hybrid Robot Actions for Imitation Learning

TL;DR

HYDRA tackles imitation-learning distribution shift by introducing a two-level action representation that combines sparse waypoints with dense low-level actions and by performing offline action relabeling to boost dataset consistency. The method uses a multi-headed architecture to predict modes, waypoints, and actions, enabling dynamic switching between coarse and fine-grained control at test time. Empirical results across seven long-horizon manipulation tasks in simulation and the real world show 30-40% improvements over strong baselines, with ablations highlighting the benefits of action relabeling and hybrid action spaces. HYDRA demonstrates robust performance in challenging, real-world robotics tasks and offers a practical approach to balancing dexterity and data efficiency in imitation learning.

Abstract

Imitation Learning (IL) is a sample efficient paradigm for robot learning using expert demonstrations. However, policies learned through IL suffer from state distribution shift at test time, due to compounding errors in action prediction which lead to previously unseen states. Choosing an action representation for the policy that minimizes this distribution shift is critical in imitation learning. Prior work propose using temporal action abstractions to reduce compounding errors, but they often sacrifice policy dexterity or require domain-specific knowledge. To address these trade-offs, we introduce HYDRA, a method that leverages a hybrid action space with two levels of action abstractions: sparse high-level waypoints and dense low-level actions. HYDRA dynamically switches between action abstractions at test time to enable both coarse and fine-grained control of a robot. In addition, HYDRA employs action relabeling to increase the consistency of actions in the dataset, further reducing distribution shift. HYDRA outperforms prior imitation learning methods by 30-40% on seven challenging simulation and real world environments, involving long-horizon tasks in the real world like making coffee and toasting bread. Videos are found on our website: https://tinyurl.com/3mc6793z
Paper Structure (28 sections, 4 equations, 7 figures, 7 tables, 3 algorithms)

This paper contains 28 sections, 4 equations, 7 figures, 7 tables, 3 algorithms.

Figures (7)

  • Figure 1: Multi-headed architecture of HYDRA: During training, we learn to predict waypoints, low level actions, and the mode label for each time step. One network (Dense Net) predicts the low level action $a_t$ and the mode $m_t$; both the action and mode heads of Dense Net share an intermediate representation $e_t$. A separate network (Sparse Net) predicts the high level waypoint $w_t$. At test time, we sample $m_t$ and either servo to reach a waypoint ($m_t = 0$) without requerying the policy, or follow a dense action for one time step ($m_t = 1$). An example of how sparse and dense modes can be arbitrarily stitched together at test time is shown on the right.
  • Figure 2: Simulation & Real-world environments, with task stages shown for real world tasks. Simulation: In NutAssemblySquare, we pick up a square nut at various positions and orientations and insert it onto a vertical square peg. In ToolHang, a hanging frame is inserted onto a fixed stand, followed by placing a tool on the frame. Both the frame and tool poses are randomized. Frame insertion is challenging due to the small insertion area. KitchenEnv involves turning on a stove, moving a pot onto the stove, putting an object in the pot, then moving the pot to a serving area. Real World: PegInsertion involves inserting a peg with a hole in the center onto a round insertion rod (top right); the peg location and geometry are varied. MakeCoffee is a 6-step task (top middle row) involving picking up a K-pod, inserting it into a Keurig machine, closing the lid of the Keurig, positioning a mug, and then pressing start on the Keurig; the K-pod location and mug orientations are varied. Unlike prior work zhu2022viola, we include a mug component. MakeToast has 7-steps (bottom middle row): a hinged toaster oven is opened, a spatula is picked up, bread is placed in the toaster, the toaster is closed, and the dial is turned to start. Bread and spatula initial poses vary. SortDishes (bottom row) has 6 stages: pick up spoon, place spoon in rack, grasp plate and insert it into rack, and grasp mug and hang the mug. All objects poses vary.
  • Figure 3: Sim Results for HYDRA vs. BC, BC-RNN, and VIOLA: best checkpoint success rate averaged over three seeds. Left to Right: NutAssemblySquare (state), ToolHang (state), and KitchenEnv (vision) tasks. HYDRA beats baselines on all of these tasks, and even beats VIOLA zhu2022viola on the kitchen task despite using a much smaller and simpler model. We also show a comparison for BC-RNN and HYDRA with decreasing data sizes for NutAssemblySquare, showing that our method is more sample efficient than BC-RNN. HYDRA without action relabeling (HYDRA-NR, NutAssemblySquare and ToolHang) drops performance by 7-8%.
  • Figure 4: Real Results for HYDRA vs. BC, BC-RNN, and VIOLA. The x-axis denotes each stage (right-most value is the final success rate). Top Left: HYDRA vs. BC-RNN on the real PegInsertion task for 50 demos under 32 rollouts across 4 different nuts. This task requires very precise grasping and insertion of multiple types of nuts, which our method does with high success. While baseline is unable to perform insertion, HYDRA gets 41% success. Top Right: MakeCoffee long-horizon task for 100 demos under 10 rollouts. Our method beats baseline by 60%. Bottom Left: MakeToast long-horizon task for 100 demos under 10 rollouts. While both methods struggle to turn the toaster on, HYDRA is able to reach 50% success for 6/7 stages compared to 10% for baseline. Bottom Right: SortDishes for 100 demos under 10 rollouts. Waypoints in HYDRA precisely capture the diverse poses in this task, beating BC-RNN by 40% and 20% for the last two stages.
  • Figure 5: Mode labeling example for peg-insertion task. For each demo a human labels binary click signals at each time step (labeled during or after collection) to segment trajectories into arbitrary sequences of sparse waypoint phases and dense action phases. Left: Uncurated demo, with single clicks and sustained clicks shown. Right: Relabeled demo, with waypoint and dense segments overlayed in green and orange, respectively. We also relabel actions for the states in sparse segments with the optimal waypoint reaching action shown in white. For sparse segments, the waypoint head of HYDRA is trained to output the final waypoint at each state along the trajectory.
  • ...and 2 more figures