KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

Zixian Liu; Mingtong Zhang; Yunzhu Li

KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

Zixian Liu, Mingtong Zhang, Yunzhu Li

TL;DR

KUDA addresses open-vocabulary robotic manipulation by unifying vision-language prompting with data-driven dynamics through a keypoint-based intermediate representation. A VLM generates keypoint target specifications from language and RGBD observations, which are translated into 3D cost functions for a neural dynamics model to optimize robot trajectories via MPPI in a closed loop. The system uses a Top-K prompt library and a CLIP-based retriever to provide few-shot prompts within token limits, enabling robust generalization to diverse objects and materials. KUDA demonstrates state-of-the-art performance on tasks across ropes, granular materials, and deformable objects, highlighting the practical potential of combining language-grounded aim with learned dynamics for flexible manipulation.

Abstract

With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at http://kuda-dynamics.github.io.

KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

TL;DR

Abstract

KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)