Table of Contents
Fetching ...

What's the Move? Hybrid Imitation Learning via Salient Points

Priya Sundaresan, Hengyuan Hu, Quan Vuong, Jeannette Bohg, Dorsa Sadigh

TL;DR

SPHINX addresses the generalization gap in imitation learning for visuomotor robotics by introducing salient-point grounding and a hybrid action framework that alternates between a point-cloud–based waypoint policy and a wrist-image–based dense policy. The waypoint policy identifies semantically meaningful salient points in 3D and predicts offsets to reach them, while the dense policy refines manipulation using high-resolution wrist imagery; a learned mode predictor governs transitions between these policies. Training leverages a flexible data-collection interface to annotate salient points and support real-time mode switching, and employs temporal augmentation to maximize data efficiency. Empirically, SPHINX achieves 86.7% success across four real-world and two simulated tasks, outperforms the best IL baselines by an average of 41.1% over 440 real-world trials, and generalizes to novel views, distractors, spatial rearrangements, and faster execution speeds, demonstrating practical benefits for robust, sample-efficient robotic manipulation.

Abstract

While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce SPHINX: Salient Point-based Hybrid ImitatioN and eXecution, a flexible IL policy that leverages multimodal observations (point clouds and wrist images), along with a hybrid action space of low-frequency, sparse waypoints and high-frequency, dense end effector movements. Given 3D point cloud observations, SPHINX learns to infer task-relevant points within a point cloud, or salient points, which support spatial generalization by focusing on semantically meaningful features. These salient points serve as anchor points to predict waypoints for long-range movement, such as reaching target poses in free-space. Once near a salient point, SPHINX learns to switch to predicting dense end-effector movements given close-up wrist images for precise phases of a task. By exploiting the strengths of different input modalities and action representations for different manipulation phases, SPHINX tackles complex tasks in a sample-efficient, generalizable manner. Our method achieves 86.7% success across 4 real-world and 2 simulated tasks, outperforming the next best state-of-the-art IL baseline by 41.1% on average across 440 real world trials. SPHINX additionally generalizes to novel viewpoints, visual distractors, spatial arrangements, and execution speeds with a 1.7x speedup over the most competitive baseline. Our website (http://sphinx-manip.github.io) provides open-sourced code for data collection, training, and evaluation, along with supplementary videos.

What's the Move? Hybrid Imitation Learning via Salient Points

TL;DR

SPHINX addresses the generalization gap in imitation learning for visuomotor robotics by introducing salient-point grounding and a hybrid action framework that alternates between a point-cloud–based waypoint policy and a wrist-image–based dense policy. The waypoint policy identifies semantically meaningful salient points in 3D and predicts offsets to reach them, while the dense policy refines manipulation using high-resolution wrist imagery; a learned mode predictor governs transitions between these policies. Training leverages a flexible data-collection interface to annotate salient points and support real-time mode switching, and employs temporal augmentation to maximize data efficiency. Empirically, SPHINX achieves 86.7% success across four real-world and two simulated tasks, outperforms the best IL baselines by an average of 41.1% over 440 real-world trials, and generalizes to novel views, distractors, spatial rearrangements, and faster execution speeds, demonstrating practical benefits for robust, sample-efficient robotic manipulation.

Abstract

While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce SPHINX: Salient Point-based Hybrid ImitatioN and eXecution, a flexible IL policy that leverages multimodal observations (point clouds and wrist images), along with a hybrid action space of low-frequency, sparse waypoints and high-frequency, dense end effector movements. Given 3D point cloud observations, SPHINX learns to infer task-relevant points within a point cloud, or salient points, which support spatial generalization by focusing on semantically meaningful features. These salient points serve as anchor points to predict waypoints for long-range movement, such as reaching target poses in free-space. Once near a salient point, SPHINX learns to switch to predicting dense end-effector movements given close-up wrist images for precise phases of a task. By exploiting the strengths of different input modalities and action representations for different manipulation phases, SPHINX tackles complex tasks in a sample-efficient, generalizable manner. Our method achieves 86.7% success across 4 real-world and 2 simulated tasks, outperforming the next best state-of-the-art IL baseline by 41.1% on average across 440 real world trials. SPHINX additionally generalizes to novel viewpoints, visual distractors, spatial arrangements, and execution speeds with a 1.7x speedup over the most competitive baseline. Our website (http://sphinx-manip.github.io) provides open-sourced code for data collection, training, and evaluation, along with supplementary videos.

Paper Structure

This paper contains 20 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Sphinx is a hybrid IL agent which learns to switch amongst different modes ($m_t$) of execution to tackle complex tasks with visuospatial generalization. In waypoint mode, $\pi^{\mathrm{waypt}}$ takes a point cloud as input, and predicts a single waypoint$w_t$ as an offset ($\phi_t$) to a task-relevant salient point$z_t$ (i.e. mug handle, coffee pod, etc. denoted). After reaching a waypoint via a controller, the policy uses learned switching to a dense policy $\pi^{\mathrm{dense}}$, which takes wrist-camera images as input and outputs dense actions ($a_t$) for precise manipulation around a salient point. On the right, the policy interleaves both modes of execution to complete a long-horizon coffee-making task guided by salient points (●) and mode switches (■).
  • Figure 2: Data Collection Interface: The demonstrator visualizes a point cloud $o^{\mathrm{pcd}}_{t'}$ in a web GUI, where they can click a salient point $z_t'$ and specify a waypoint action $w_t'$ by clicking and dragging to rotate or translate a digital twin of the gripper. After the controller $\mathcal{C}$ reaches the waypoint to grasp the train, the process repeats for a waypoint above the bridge. The demonstrator then switches to providing dense actions $a_t$ with a 3D SpaceMouse to carefully place the train on the bridge and tilt it, causing the train to roll.
  • Figure 3: Sphinx-Waypoint Architecture & Training Objectives:Sphinx takes downsampled point clouds as input, generating per-point tokens $e_i$, and uses a Transformer-style architecture to predict salient points and waypoint actions (position, orientation, gripper state). Specifically, Sphinx predicts the waypoint's positional component as an offset from a salient point. The model outputs a per-point translational offset $\phi_i$, but we only penalize the offset loss on salient points (shaded) during training. Salient point prediction is supervised using cross-entropy loss ($L_{\mathrm{salient}}$) between predicted $\hat{p}_i$ and ground truth $p_i$ salient probabilities.
  • Figure 4: Success Rates Across Tasks: Left: Sphinx outperforms image-only dense baselines (OpenVLA, diffusion policy) as well as a hybrid baseline (HYDRA) across 3 challenging real-world tasks (\ref{['fig:rollouts']}) collected with hybrid mode teleoperation. Train Track requires a degree of precision that baselines lack, while Sphinx's use of salient points and hybrid actions enables precise, long-horizon manipulation. Right: Sphinx performs $1.6\times$ better than the SoTA image or point-cloud based diffusion policies across tasks teleoperated in only waypoint mode. Comparisons with the two vanilla waypoint baselines also show that both saliency prediction and the relative waypoint action representation contribute to Sphinx's strong performance.
  • Figure 5: Sphinx Rollouts: We evaluate Sphinx across a suite of challenging real-world tasks subject to wide initial state variations. Sphinx's waypoint mode alone is precise enough to handle tasks like drawer opening, while the full hybrid policy leverages different action modes to tackle complex tasks such as cup stacking, building and playing with a toy train set, and making coffee.
  • ...and 2 more figures