Table of Contents
Fetching ...

Zero-shot Interactive Perception

Venkatesh Sripada, Frank Guerin, Amir Ghalamzan

TL;DR

The paper introduces Zero-shot Interactive Perception (ZS-IP), a framework that unifies vision-language reasoning with physical robot manipulation to resolve occlusions and ambiguous queries in partially observable scenes. It combines Enhanced Observation (EO) with pushlines, grasp keypoints, and a memory-guided action module to iteratively interrogate the environment using a 7-DOF Franka Panda arm. Through a perception-action loop driven by a vision-language model, the system can push, pull, or grasp objects to reveal hidden information and answer semantic questions, outperforming passive and MOKA baselines on diverse tasks. Limitations include depth resolution and the computational demands of large multimodal models, with future work aiming to extend to richer 3D manipulation, remove fixed anchors, and incorporate tactile sensing for more robust real-world deployment.

Abstract

Interactive perception (IP) enables robots to extract hidden information in their workspace and execute manipulation plans by physically interacting with objects and altering the state of the environment -- crucial for resolving occlusions and ambiguity in complex, partially observable scenarios. We present Zero-Shot IP (ZS-IP), a novel framework that couples multi-strategy manipulation (pushing and grasping) with a memory-driven Vision Language Model (VLM) to guide robotic interactions and resolve semantic queries. ZS-IP integrates three key components: (1) an Enhanced Observation (EO) module that augments the VLM's visual perception with both conventional keypoints and our proposed pushlines -- a novel 2D visual augmentation tailored to pushing actions, (2) a memory-guided action module that reinforces semantic reasoning through context lookup, and (3) a robotic controller that executes pushing, pulling, or grasping based on VLM output. Unlike grid-based augmentations optimized for pick-and-place, pushlines capture affordances for contact-rich actions, substantially improving pushing performance. We evaluate ZS-IP on a 7-DOF Franka Panda arm across diverse scenes with varying occlusions and task complexities. Our experiments demonstrate that ZS-IP outperforms passive and viewpoint-based perception techniques such as Mark-Based Visual Prompting (MOKA), particularly in pushing tasks, while preserving the integrity of non-target elements.

Zero-shot Interactive Perception

TL;DR

The paper introduces Zero-shot Interactive Perception (ZS-IP), a framework that unifies vision-language reasoning with physical robot manipulation to resolve occlusions and ambiguous queries in partially observable scenes. It combines Enhanced Observation (EO) with pushlines, grasp keypoints, and a memory-guided action module to iteratively interrogate the environment using a 7-DOF Franka Panda arm. Through a perception-action loop driven by a vision-language model, the system can push, pull, or grasp objects to reveal hidden information and answer semantic questions, outperforming passive and MOKA baselines on diverse tasks. Limitations include depth resolution and the computational demands of large multimodal models, with future work aiming to extend to richer 3D manipulation, remove fixed anchors, and incorporate tactile sensing for more robust real-world deployment.

Abstract

Interactive perception (IP) enables robots to extract hidden information in their workspace and execute manipulation plans by physically interacting with objects and altering the state of the environment -- crucial for resolving occlusions and ambiguity in complex, partially observable scenarios. We present Zero-Shot IP (ZS-IP), a novel framework that couples multi-strategy manipulation (pushing and grasping) with a memory-driven Vision Language Model (VLM) to guide robotic interactions and resolve semantic queries. ZS-IP integrates three key components: (1) an Enhanced Observation (EO) module that augments the VLM's visual perception with both conventional keypoints and our proposed pushlines -- a novel 2D visual augmentation tailored to pushing actions, (2) a memory-guided action module that reinforces semantic reasoning through context lookup, and (3) a robotic controller that executes pushing, pulling, or grasping based on VLM output. Unlike grid-based augmentations optimized for pick-and-place, pushlines capture affordances for contact-rich actions, substantially improving pushing performance. We evaluate ZS-IP on a 7-DOF Franka Panda arm across diverse scenes with varying occlusions and task complexities. Our experiments demonstrate that ZS-IP outperforms passive and viewpoint-based perception techniques such as Mark-Based Visual Prompting (MOKA), particularly in pushing tasks, while preserving the integrity of non-target elements.
Paper Structure (17 sections, 1 equation, 12 figures, 5 tables)

This paper contains 17 sections, 1 equation, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Zero-shot Interactive Perception Framework: The Perception Analyser (i.e. $z_t = \mathcal{F}_{VLM}({o}_t, q)$) evaluates the scene observation $o_t$ at robot configuration $\textbf{x}_t$ and attempts to resolve the query, $q$. If unsuccessful, EO annotates the image, $\tilde{o}_t = EO(o_t)$, after segmenting target objects with Grounded SAM, e.g. push lines ($EO_{\text{P}}$), keypoints ($EO_{\text{G}}$), and 2D grid ($EO_{\text{AP}}$). A history of interactions is kept in memory $\mathcal{M}_t$ to avoid redundant actions and enhance task efficiency. Action $a_t = \pi( \mathcal{M}_t, \textbf{x}_t, z_t, \tilde{o}_t)$ is generated to make a proper physical interaction to answer the query. $\mathcal{M}_t$ states, $S_t$ contains both images $o_t$ and a summary of scene description, $s_t$, relevant to query. The robot performs the interaction and saves the corresponding image and robot's state in $S_{t+1}$ which then goes to the Perception Analyser.
  • Figure 3: Robot workspace during pushing actions. A source of light produces a strong shadow. The top row shows Task I, uncovering a box of paper clips under the whiteboard cleaner, the middle row shows Task II pushing an aluminium tin, revealing a small screw, and the bottom row shows Task III a clutter of Lego blocks.
  • Figure 4: Task VII being successfully performed at various time steps (query: 'What language is the text written in on the black eraser?'): (a) $t=1$: {$x_1, o_1$} with pushing action proposed, $EO_P$, (b) $t = 2$: {$x_2, o_2$} after pushing, red and yellow objects are separated from the eraser. Grasping key points are now proposed $EO_G$, (c) $t=2$: Cam2 view of eraser that is lifted and moved by the robot in front of the Cam2, (d) Task VIII: brown book partially occludes the view into the cup.
  • Figure 5: (From left to right) First and Third row: Task I to VIII. Second and Fourth row: the robot's workspace in Task I to VIII. Task I: the robot needs to push the eraser to see the box of paper clips underneath. Task II: The robot needs to pick up the aluminium tin and place it elsewhere to see the screw in the shadow. Task III: Push the Lego blocks to see the pen-drive underneath. Task IV: Pick up the blue block and place it elsewhere to see the text written on the paper under it. Task V: Identify the red block and green Lego structure on the table. Arrange them inside the cardboard box while maintaining their original spatial positions relative to one another. Task VI: grasp the black eraser and bring it in front of the secondary camera to see what is written on the side under it. Task VII: similar to Task VI but eraser is between two objects. Hence ZS-IP requires first pushing the eraser and then grasping to complete the task. Task VIII: ZS-IP should see what is inside the cup covered by the brown book.
  • Figure 6: Perception Analyser
  • ...and 7 more figures