Table of Contents
Fetching ...

FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation

Chao Tang, Anxing Xiao, Yuhong Deng, Tianrun Hu, Wenlong Dong, Hanbo Zhang, David Hsu, Hong Zhang

TL;DR

This work tackles one-shot imitation learning for tool manipulation by addressing intra-function geometric variation with a function-centric approach. FUNCTO introduces a 3D functional keypoint representation (function point, grasp point, center) and a three-stage pipeline—functional keypoint extraction, function-centric correspondence establishment, and functional keypoint-based action planning—to transfer skills from a single demonstration to novel tools. Extensive real-robot experiments show that FUNCTO outperforms modular OSIL methods and end-to-end BC baselines, demonstrating strong generalization to unseen tools while maintaining task feasibility. By focusing on functional, rather than purely geometric, correspondences and leveraging vision-language prompts for keypoint detection and refinement, the method offers a data-efficient path to robust tool use in robotics with potential for broader applications.

Abstract

Learning tool use from a single human demonstration video offers a highly intuitive and efficient approach to robot teaching. While humans can effortlessly generalize a demonstrated tool manipulation skill to diverse tools that support the same function (e.g., pouring with a mug versus a teapot), current one-shot imitation learning (OSIL) methods struggle to achieve this. A key challenge lies in establishing functional correspondences between demonstration and test tools, considering significant geometric variations among tools with the same function (i.e., intra-function variations). To address this challenge, we propose FUNCTO (Function-Centric OSIL for Tool Manipulation), an OSIL method that establishes function-centric correspondences with a 3D functional keypoint representation, enabling robots to generalize tool manipulation skills from a single human demonstration video to novel tools with the same function despite significant intra-function variations. With this formulation, we factorize FUNCTO into three stages: (1) functional keypoint extraction, (2) function-centric correspondence establishment, and (3) functional keypoint-based action planning. We evaluate FUNCTO against exiting modular OSIL methods and end-to-end behavioral cloning methods through real-robot experiments on diverse tool manipulation tasks. The results demonstrate the superiority of FUNCTO when generalizing to novel tools with intra-function geometric variations. More details are available at https://sites.google.com/view/functo.

FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation

TL;DR

This work tackles one-shot imitation learning for tool manipulation by addressing intra-function geometric variation with a function-centric approach. FUNCTO introduces a 3D functional keypoint representation (function point, grasp point, center) and a three-stage pipeline—functional keypoint extraction, function-centric correspondence establishment, and functional keypoint-based action planning—to transfer skills from a single demonstration to novel tools. Extensive real-robot experiments show that FUNCTO outperforms modular OSIL methods and end-to-end BC baselines, demonstrating strong generalization to unseen tools while maintaining task feasibility. By focusing on functional, rather than purely geometric, correspondences and leveraging vision-language prompts for keypoint detection and refinement, the method offers a data-efficient path to robust tool use in robotics with potential for broader applications.

Abstract

Learning tool use from a single human demonstration video offers a highly intuitive and efficient approach to robot teaching. While humans can effortlessly generalize a demonstrated tool manipulation skill to diverse tools that support the same function (e.g., pouring with a mug versus a teapot), current one-shot imitation learning (OSIL) methods struggle to achieve this. A key challenge lies in establishing functional correspondences between demonstration and test tools, considering significant geometric variations among tools with the same function (i.e., intra-function variations). To address this challenge, we propose FUNCTO (Function-Centric OSIL for Tool Manipulation), an OSIL method that establishes function-centric correspondences with a 3D functional keypoint representation, enabling robots to generalize tool manipulation skills from a single human demonstration video to novel tools with the same function despite significant intra-function variations. With this formulation, we factorize FUNCTO into three stages: (1) functional keypoint extraction, (2) function-centric correspondence establishment, and (3) functional keypoint-based action planning. We evaluate FUNCTO against exiting modular OSIL methods and end-to-end behavioral cloning methods through real-robot experiments on diverse tool manipulation tasks. The results demonstrate the superiority of FUNCTO when generalizing to novel tools with intra-function geometric variations. More details are available at https://sites.google.com/view/functo.

Paper Structure

This paper contains 17 sections, 17 equations, 20 figures, 2 tables, 1 algorithm.

Figures (20)

  • Figure 1: FUNCTO establishes functional correspondences between demonstration and test tools using 3D functional keypoints. With a single human demonstration video, FUNCTO generalizes the demonstrated tool manipulation skill to novel tools, even with significant intra-function geometric variations.
  • Figure 2: An overview of the FUNCTO framework. The pipeline consists of three stages: (1) Functional keypoint extraction, where functional keypoints and their trajectories are extracted from the human demonstration video; (2) Function-centric correspondence establishment, where function-centric correspondences between demonstration and test tools are established using geometric constraints on the functional keypoints; and (3) Functional keypoint-based action planning, where the test tool trajectory is synthesized and executed to accomplish a functionally equivalent task.
  • Figure 3: A graphical illustration of the function point, grasp point, center point, effect point, target point, and target frame.
  • Figure 4: Qualitative results of function point transfer. (a) shows the function point extracted from the human demonstration. Function points in (b) and (c) are proposed by the VLM in a zero-shot manner. (d) shows the transferred function point using (a) as a reference.
  • Figure 5: An illustration of function axis alignment process: (1) test function plane $\Pi_R^0$, (2) demonstration function plane $\Pi_H^{t_f}$, (3) initially aligned test function plane $\Pi_R^{t_f}$, and (4) VLM refined test function plane $\Pi_R^{t_f}$.
  • ...and 15 more figures