Table of Contents
Fetching ...

Embodied Perception for Test-time Grasping Detection Adaptation with Knowledge Infusion

Jin Liu, Jialong Xie, Leibing Xiao, Chaoqun Wang, Fengyu Zhou

TL;DR

The paper tackles the generalization gap in robotic grasp detection for unseen environments by proposing an embodied test-time adaptation framework that leverages active exploration and a knowledge infusion mechanism. It decomposes the system into a Grasping Knowledge Retrieval Module, an Embodied Perception Module with pre-distributed viewpoints, and a Network Optimization Module to adapt both the grasp detector and the knowledge base online without annotations. Key contributions include a knowledge-pool-guided initialization of viewpoints, a rigorous embodied parameter–based quality assessment to preserve high-quality samples, and a joint optimization objective that combines $\\mathcal{L}_{act}$ and $\\mathcal{L}_{know}$ for continuous online learning. Real-robot experiments show significant improvements in cross-domain and same-domain grasping accuracy over strong baselines, demonstrating practical impact for autonomous, label-free adaptation in dynamic environments.

Abstract

It has always been expected that a robot can be easily deployed to unknown scenarios, accomplishing robotic grasping tasks without human intervention. Nevertheless, existing grasp detection approaches are typically off-body techniques and are realized by training various deep neural networks with extensive annotated data support. {In this paper, we propose an embodied test-time adaptation framework for grasp detection that exploits the robot's exploratory capabilities.} The framework aims to improve the generalization performance of grasping skills for robots in an unforeseen environment. Specifically, we introduce embodied assessment criteria based on the robot's manipulation capability to evaluate the quality of the grasp detection and maintain suitable samples. This process empowers the robots to actively explore the environment and continuously learn grasping skills, eliminating human intervention. Besides, to improve the efficiency of robot exploration, we construct a flexible knowledge base to provide context of initial optimal viewpoints. Conditioned on the maintained samples, the grasp detection networks can be adapted in the test-time scene. When the robot confronts new objects, it will undergo the same adaptation procedure mentioned above to realize continuous learning. Extensive experiments conducted on a real-world robot demonstrate the effectiveness and generalization of our proposed framework.

Embodied Perception for Test-time Grasping Detection Adaptation with Knowledge Infusion

TL;DR

The paper tackles the generalization gap in robotic grasp detection for unseen environments by proposing an embodied test-time adaptation framework that leverages active exploration and a knowledge infusion mechanism. It decomposes the system into a Grasping Knowledge Retrieval Module, an Embodied Perception Module with pre-distributed viewpoints, and a Network Optimization Module to adapt both the grasp detector and the knowledge base online without annotations. Key contributions include a knowledge-pool-guided initialization of viewpoints, a rigorous embodied parameter–based quality assessment to preserve high-quality samples, and a joint optimization objective that combines and for continuous online learning. Real-robot experiments show significant improvements in cross-domain and same-domain grasping accuracy over strong baselines, demonstrating practical impact for autonomous, label-free adaptation in dynamic environments.

Abstract

It has always been expected that a robot can be easily deployed to unknown scenarios, accomplishing robotic grasping tasks without human intervention. Nevertheless, existing grasp detection approaches are typically off-body techniques and are realized by training various deep neural networks with extensive annotated data support. {In this paper, we propose an embodied test-time adaptation framework for grasp detection that exploits the robot's exploratory capabilities.} The framework aims to improve the generalization performance of grasping skills for robots in an unforeseen environment. Specifically, we introduce embodied assessment criteria based on the robot's manipulation capability to evaluate the quality of the grasp detection and maintain suitable samples. This process empowers the robots to actively explore the environment and continuously learn grasping skills, eliminating human intervention. Besides, to improve the efficiency of robot exploration, we construct a flexible knowledge base to provide context of initial optimal viewpoints. Conditioned on the maintained samples, the grasp detection networks can be adapted in the test-time scene. When the robot confronts new objects, it will undergo the same adaptation procedure mentioned above to realize continuous learning. Extensive experiments conducted on a real-world robot demonstrate the effectiveness and generalization of our proposed framework.

Paper Structure

This paper contains 25 sections, 15 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: An example of embodied test-time grasp detection for robotics, where the robot can only access the unlabelled data from unseen scenes with one pre-trained grasp detection network. The green rectangle indicates a viable grasping posture, whereas the red rectangle indicates an unsuccessful one.
  • Figure 2: An overview of the proposed embodied test-time adaptation framework for grasp detection. The robot first retrieves the historical grasping knowledge related to the optimal candidate viewpoint. Then, it actively explores different viewpoints and preserves optimal samples based on embodied assessment indicators. Finally, conditioned on the collected samples, the knowledge retrieval network and grasp detection network are optimized. These optimized networks during test time are deployed in the current scene to facilitate scene adaptation.
  • Figure 3: Examples of active explorations. The figure on the left illustrates the observation positions, whereas the one on the right shows the viewpoints.
  • Figure 4: An example of the process of obtaining the convex hull and object centroid.
  • Figure 5: (a) Overview of the robotic grasping platform. (b) Objects utilized in test-time adaptation.
  • ...and 5 more figures