Table of Contents
Fetching ...

INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

Hanbo Zhang, Yunfan Lu, Cunjun Yu, David Hsu, Xuguang Lan, Nanning Zheng

TL;DR

INVIGORATE tackles robust interactive visual grounding and grasping of a linguistically specified target object in cluttered scenes by integrating four neural networks for perception and language (O-Net, G-Net, Q-Net, R-Net) with an object-centric POMDP. The system maintains a belief over target objects and object-blocking relations, uses learned observation models and a policy-tree search to decide when to ask disambiguating questions or to grasp, and demonstrates superior performance (83% success) and data-efficient interaction on a Fetch robot compared to a purely learned baseline. Key contributions include a principled integration of neural modules with POMDP planning, an object-centric state representation, and a critiqued captioning approach for robust disambiguation questions. This hybrid approach advances robust robot-human collaboration in cluttered environments, enabling reliable language-guided object retrieval under perceptual and linguistic uncertainty.

Abstract

This paper presents INVIGORATE, a robot system that interacts with human through natural language and grasps a specified object in clutter. The objects may occlude, obstruct, or even stack on top of one another. INVIGORATE embodies several challenges: (i) infer the target object among other occluding objects, from input language expressions and RGB images, (ii) infer object blocking relationships (OBRs) from the images, and (iii) synthesize a multi-step plan to ask questions that disambiguate the target object and to grasp it successfully. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. They allow for unrestricted object categories and language expressions, subject to the training datasets. However, errors in visual perception and ambiguity in human languages are inevitable and negatively impact the robot's performance. To overcome these uncertainties, we build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules. Through approximate POMDP planning, the robot tracks the history of observations and asks disambiguation questions in order to achieve a near-optimal sequence of actions that identify and grasp the target object. INVIGORATE combines the benefits of model-based POMDP planning and data-driven deep learning. Preliminary experiments with INVIGORATE on a Fetch robot show significant benefits of this integrated approach to object grasping in clutter with natural language interactions. A demonstration video is available at https://youtu.be/zYakh80SGcU.

INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

TL;DR

INVIGORATE tackles robust interactive visual grounding and grasping of a linguistically specified target object in cluttered scenes by integrating four neural networks for perception and language (O-Net, G-Net, Q-Net, R-Net) with an object-centric POMDP. The system maintains a belief over target objects and object-blocking relations, uses learned observation models and a policy-tree search to decide when to ask disambiguating questions or to grasp, and demonstrates superior performance (83% success) and data-efficient interaction on a Fetch robot compared to a purely learned baseline. Key contributions include a principled integration of neural modules with POMDP planning, an object-centric state representation, and a critiqued captioning approach for robust disambiguation questions. This hybrid approach advances robust robot-human collaboration in cluttered environments, enabling reliable language-guided object retrieval under perceptual and linguistic uncertainty.

Abstract

This paper presents INVIGORATE, a robot system that interacts with human through natural language and grasps a specified object in clutter. The objects may occlude, obstruct, or even stack on top of one another. INVIGORATE embodies several challenges: (i) infer the target object among other occluding objects, from input language expressions and RGB images, (ii) infer object blocking relationships (OBRs) from the images, and (iii) synthesize a multi-step plan to ask questions that disambiguate the target object and to grasp it successfully. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. They allow for unrestricted object categories and language expressions, subject to the training datasets. However, errors in visual perception and ambiguity in human languages are inevitable and negatively impact the robot's performance. To overcome these uncertainties, we build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules. Through approximate POMDP planning, the robot tracks the history of observations and asks disambiguation questions in order to achieve a near-optimal sequence of actions that identify and grasp the target object. INVIGORATE combines the benefits of model-based POMDP planning and data-driven deep learning. Preliminary experiments with INVIGORATE on a Fetch robot show significant benefits of this integrated approach to object grasping in clutter with natural language interactions. A demonstration video is available at https://youtu.be/zYakh80SGcU.

Paper Structure

This paper contains 33 sections, 13 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Interactive visual grounding and grasping in clutter. The robot receives a verbal instruction from the human to retrieve an object. It tries to identify the target object visually, asks questions to disambiguate the target object, if necessary, and eventually grasps the object. (a) Perceptual uncertainties. The object detection module fails to detect the notebook because of visual occlusion. (b) Language ambiguity. The instruction is ambiguous. There are two remote controllers, one black and one white. Since both satisfy the instruction, the robot asks questions to disambiguate.
  • Figure 2: An overview of INVIGORATE. INVIGORATE integrates data-driven deep learning with model-based POMDP planning. It consists of three components: POMDP planning (top), belief tracking (middle), and visual and language processing (bottom).
  • Figure 3: An example of object-centric belief. Objects are represented using nodes, and the $b^{g}$ is denoted using histograms beside each object. Arrows between objects represent $b^{r}$, meaning that all object relationships are probabilistic. Dashed arrows mean relationships with lower probability.
  • Figure 4: An example of grasping macro for the blue box in the scene. Left: the scene image with a red box of the target blue box. Right: the object blocking graph of the left image, with a grasping macro marked by red dashed line to grasp the blue box.
  • Figure 5: An overview of policy tree search. Circles denote beliefs and squares denote possible actions. It searches all possible trajectories to find the optimal one (noted as the red path). Then the robot will execute the first action (noted as the pink square) with the highest expected cumulative reward.
  • ...and 3 more figures