Table of Contents
Fetching ...

KITE: Keypoint-Conditioned Policies for Semantic Manipulation

Priya Sundaresan, Suneel Belkhale, Dorsa Sadigh, Jeannette Bohg

TL;DR

KITE introduces a two-stage framework that grounds language into 2D keypoints and uses keypoint-conditioned 6-DoF skills to perform semantic manipulation. By bridging scene and object semantics with a compact library of parameterized policies, KITE achieves fine-grained manipulation with strong generalization across unseen objects and tasks while requiring relatively modest demonstration data. The approach yields competitive real-world performance in long-horizon tabletop tasks, semantic grasping, and coffee-making, outperforming end-to-end visuomotor baselines and VLM-guided variants. This work highlights the benefits of an interpretable, object-centric intermediate representation for efficient, precise instruction-following in real-world robotics.

Abstract

While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation, where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution (KITE), a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a 75%, 70%, and 71% overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations. Supplementary material, datasets, code, and videos can be found on our website: http://tinyurl.com/kite-site.

KITE: Keypoint-Conditioned Policies for Semantic Manipulation

TL;DR

KITE introduces a two-stage framework that grounds language into 2D keypoints and uses keypoint-conditioned 6-DoF skills to perform semantic manipulation. By bridging scene and object semantics with a compact library of parameterized policies, KITE achieves fine-grained manipulation with strong generalization across unseen objects and tasks while requiring relatively modest demonstration data. The approach yields competitive real-world performance in long-horizon tabletop tasks, semantic grasping, and coffee-making, outperforming end-to-end visuomotor baselines and VLM-guided variants. This work highlights the benefits of an interpretable, object-centric intermediate representation for efficient, precise instruction-following in real-world robotics.

Abstract

While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation, where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution (KITE), a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a 75%, 70%, and 71% overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations. Supplementary material, datasets, code, and videos can be found on our website: http://tinyurl.com/kite-site.
Paper Structure (21 sections, 2 equations, 6 figures, 6 tables)

This paper contains 21 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Real-World Semantic Manipulation Environments: We visualize our semantic manipulation framework KITE on three real-world environments: long-horizon instruction following, semantic grasping, and coffee-making. Using keypoint-based grounding, KITE contextualizes scene-level semantics ('Pick up the green/red/blue/brown coffee pod') as well as object-level semantics ('Pick up the unicorn by the leg/ear/tail', 'Open the cabinet by the top/middle/bottom shelf') and precisely executes keypoint-conditioned skills.
  • Figure 2: KITE System Overview: KITE receives an image observation $I_t$ along with user instruction $i_t$ and grounds these inputs to a 2D semantic keypoint in the image. After inferring which skill type $l_t$ is appropriate from a set of skill labels, KITE takes an RGB-D point cloud observation $\mathcal{P}_t$, annotated with the deprojected keypoint $\mathcal{M}_t$, and infers the appropriate waypoint policy $\pi$ for execution. After executing this action, KITE replans based on a new observation $(I_{t+1}, i_{t+1})$ and repeats the whole process.
  • Figure 3: Semantic Grasping Experimental Setup: We evaluate KITE on semantic grasping across rigid tools, deformable objects, and articulated items. We show 17 of the 20 objects tested along with ground-truth semantic labels for different features. The top row includes objects seen during grounding module training, and the bottom consists of unseen object instances.
  • Figure 4: KITE Grounding Predictions: KITE's grounding model is able to accurately predict keypoints for both scene semantic instructions (e.g., "grab the lemon" and "put the green pod in") and object semantic instructions (e.g., "shut the top drawer" and "take a peek at the 2nd shelf".
  • Figure 5: PerAct Predictions: We visualize PerAct predictions on the task of opening a cabinet with multiple drawers. Although PerAct exhibits some reasonable predictions (last column), it struggles with localizing the correct handle (1st, 3rd columns). Even when localizing the correct handle (2nd column), the slight imprecision of the predict vs. ground truth action can lead to downstream manipulation failure.
  • ...and 1 more figures