Table of Contents
Fetching ...

Visual Prompt Guided Unified Pushing Policy

Hieu Bui, Ziyan Gao, Yuya Hosoda, Joo-Ho Lee

TL;DR

Experimental results demonstrate that the proposed unified pushing policy not only outperforms existing baselines but also effectively serves as a low-level primitive within a VLM-guided planning framework to solve table-cleaning tasks efficiently.

Abstract

As one of the simplest non-prehensile manipulation skills, pushing has been widely studied as an effective means to rearrange objects. Existing approaches, however, typically rely on multi-step push plans composed of pre-defined pushing primitives with limited application scopes, which restrict their efficiency and versatility across different scenarios. In this work, we propose a unified pushing policy that incorporates a lightweight prompting mechanism into a flow matching policy to guide the generation of reactive, multimodal pushing actions. The visual prompt can be specified by a high-level planner, enabling the reuse of the pushing policy across a wide range of planning problems. Experimental results demonstrate that the proposed unified pushing policy not only outperforms existing baselines but also effectively serves as a low-level primitive within a VLM-guided planning framework to solve table-cleaning tasks efficiently.

Visual Prompt Guided Unified Pushing Policy

TL;DR

Experimental results demonstrate that the proposed unified pushing policy not only outperforms existing baselines but also effectively serves as a low-level primitive within a VLM-guided planning framework to solve table-cleaning tasks efficiently.

Abstract

As one of the simplest non-prehensile manipulation skills, pushing has been widely studied as an effective means to rearrange objects. Existing approaches, however, typically rely on multi-step push plans composed of pre-defined pushing primitives with limited application scopes, which restrict their efficiency and versatility across different scenarios. In this work, we propose a unified pushing policy that incorporates a lightweight prompting mechanism into a flow matching policy to guide the generation of reactive, multimodal pushing actions. The visual prompt can be specified by a high-level planner, enabling the reuse of the pushing policy across a wide range of planning problems. Experimental results demonstrate that the proposed unified pushing policy not only outperforms existing baselines but also effectively serves as a low-level primitive within a VLM-guided planning framework to solve table-cleaning tasks efficiently.
Paper Structure (23 sections, 7 equations, 6 figures, 2 tables)

This paper contains 23 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Illustration of a specific table-cleaning task in which all red blocks must be placed in the left staging area, while blue blocks are placed in the right staging area. The numbered annotations indicate one possible sequence of actions considering the feasibility and efficiency.
  • Figure 2: Model Architecture. The input consists of the visual prompt and the latest $T_{\text{obs}}$ steps of image data and robot proprioception. The policy is parameterized by a Diffusion Transformer with alternating self-attention and cross-attention DiT blocks to denoise action tokens $\mathbf{A^0}$ into executable trajectories $\mathbf{A^1}.$
  • Figure 3: Illustration of the visual prompt. Left: Input prompts consisting of two points: $\mathbf{u}_1$ (blue) and $\mathbf{u}_2$ (green). Right: Policy-generated action trajectories. The warm and cold colors represent the trajectory of the right and left finger of the gripper.
  • Figure 4: The experimental setup consists of a leader-follower system. The follower (left) is equipped with a wrist-mounted camera and a parallel jaw gripper. The leader device (middle) is used to teleoperate the follower. In the data collection phase, only the red blocks were used, while the embedded figure at the bottom right shows objects used for evaluation.
  • Figure 5: Overview of the VLM planning framework
  • ...and 1 more figures