What's the Move? Hybrid Imitation Learning via Salient Points
Priya Sundaresan, Hengyuan Hu, Quan Vuong, Jeannette Bohg, Dorsa Sadigh
TL;DR
SPHINX addresses the generalization gap in imitation learning for visuomotor robotics by introducing salient-point grounding and a hybrid action framework that alternates between a point-cloud–based waypoint policy and a wrist-image–based dense policy. The waypoint policy identifies semantically meaningful salient points in 3D and predicts offsets to reach them, while the dense policy refines manipulation using high-resolution wrist imagery; a learned mode predictor governs transitions between these policies. Training leverages a flexible data-collection interface to annotate salient points and support real-time mode switching, and employs temporal augmentation to maximize data efficiency. Empirically, SPHINX achieves 86.7% success across four real-world and two simulated tasks, outperforms the best IL baselines by an average of 41.1% over 440 real-world trials, and generalizes to novel views, distractors, spatial rearrangements, and faster execution speeds, demonstrating practical benefits for robust, sample-efficient robotic manipulation.
Abstract
While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce SPHINX: Salient Point-based Hybrid ImitatioN and eXecution, a flexible IL policy that leverages multimodal observations (point clouds and wrist images), along with a hybrid action space of low-frequency, sparse waypoints and high-frequency, dense end effector movements. Given 3D point cloud observations, SPHINX learns to infer task-relevant points within a point cloud, or salient points, which support spatial generalization by focusing on semantically meaningful features. These salient points serve as anchor points to predict waypoints for long-range movement, such as reaching target poses in free-space. Once near a salient point, SPHINX learns to switch to predicting dense end-effector movements given close-up wrist images for precise phases of a task. By exploiting the strengths of different input modalities and action representations for different manipulation phases, SPHINX tackles complex tasks in a sample-efficient, generalizable manner. Our method achieves 86.7% success across 4 real-world and 2 simulated tasks, outperforming the next best state-of-the-art IL baseline by 41.1% on average across 440 real world trials. SPHINX additionally generalizes to novel viewpoints, visual distractors, spatial arrangements, and execution speeds with a 1.7x speedup over the most competitive baseline. Our website (http://sphinx-manip.github.io) provides open-sourced code for data collection, training, and evaluation, along with supplementary videos.
