Table of Contents
Fetching ...

VBM-NET: Visual Base Pose Learning for Mobile Manipulation using Equivariant TransporterNet and GNNs

Lakshadeep Naik, Adam Fischer, Daniel Duberg, Danica Kragic

TL;DR

VBM-Net addresses base pose planning for mobile manipulation by learning from top-down orthographic scene representations instead of relying on precise object and environment state estimates. It combines a two-stage policy that first identifies candidate base poses with an equivariant TransporterNet and then selects the optimal pose with a graph-based policy, formalized as a combined policy that leverages navigation cost. By using orthographic projections and graph reasoning, it addresses sample efficiency and variable pose counts, and it demonstrates competitive performance with significantly reduced planning time and successful sim-to-real transfer. The work highlights a practical approach to navigation-aware grasping from visuals and points to future improvements in generalization and sequential planning.

Abstract

In Mobile Manipulation, selecting an optimal mobile base pose is essential for successful object grasping. Previous works have addressed this problem either through classical planning methods or by learning state-based policies. They assume access to reliable state information, such as the precise object poses and environment models. In this work, we study base pose planning directly from top-down orthographic projections of the scene, which provide a global overview of the scene while preserving spatial structure. We propose VBM-NET, a learning-based method for base pose selection using such top-down orthographic projections. We use equivariant TransporterNet to exploit spatial symmetries and efficiently learn candidate base poses for grasping. Further, we use graph neural networks to represent a varying number of candidate base poses and use Reinforcement Learning to determine the optimal base pose among them. We show that VBM-NET can produce comparable solutions to the classical methods in significantly less computation time. Furthermore, we validate sim-to-real transfer by successfully deploying a policy trained in simulation to real-world mobile manipulation.

VBM-NET: Visual Base Pose Learning for Mobile Manipulation using Equivariant TransporterNet and GNNs

TL;DR

VBM-Net addresses base pose planning for mobile manipulation by learning from top-down orthographic scene representations instead of relying on precise object and environment state estimates. It combines a two-stage policy that first identifies candidate base poses with an equivariant TransporterNet and then selects the optimal pose with a graph-based policy, formalized as a combined policy that leverages navigation cost. By using orthographic projections and graph reasoning, it addresses sample efficiency and variable pose counts, and it demonstrates competitive performance with significantly reduced planning time and successful sim-to-real transfer. The work highlights a practical approach to navigation-aware grasping from visuals and points to future improvements in generalization and sequential planning.

Abstract

In Mobile Manipulation, selecting an optimal mobile base pose is essential for successful object grasping. Previous works have addressed this problem either through classical planning methods or by learning state-based policies. They assume access to reliable state information, such as the precise object poses and environment models. In this work, we study base pose planning directly from top-down orthographic projections of the scene, which provide a global overview of the scene while preserving spatial structure. We propose VBM-NET, a learning-based method for base pose selection using such top-down orthographic projections. We use equivariant TransporterNet to exploit spatial symmetries and efficiently learn candidate base poses for grasping. Further, we use graph neural networks to represent a varying number of candidate base poses and use Reinforcement Learning to determine the optimal base pose among them. We show that VBM-NET can produce comparable solutions to the classical methods in significantly less computation time. Furthermore, we validate sim-to-real transfer by successfully deploying a policy trained in simulation to real-world mobile manipulation.

Paper Structure

This paper contains 20 sections, 13 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: VBM-Net overview. Top row: Orthographic top-down projection of the scene is processed by TransporterNet to predict potential base poses for grasping (yellow arrows). Bottom row: The scene and candidate base poses are encoded as a graph, and a Graph Neural Network is used for selecting the optimal base pose (green arrow).
  • Figure 2: VBM-Net (a) Stage 1: Learning potential base poses for grasping using the policy $\pi_{\text{irm}}$. (b) Stage 2: Selecting the optimal base pose by incorporating navigation cost, using the policy $\pi_{\text{bp}}^{}$, from the candidate poses generated in Stage 1.
  • Figure 3: Learning potential base poses for grasping the object: (a) Input RGB and depth images of the scene with the cropped robot query. (b) Equivariant networks for extracting feature embeddings with TransporterNet. (c) Generated feature embeddings. (d) Identified feasible base positions with the robot’s original orientation + 135$\degree$ (counterclockwise). (e) Rotated robot feature embeddings. (f) Identified feasible base positions with the robot’s original orientation + 225$\degree$ (counterclockwise).
  • Figure 4: Attention-based graph encoder for candidate base poses in $\mathcal{A}_{\text{valid}}(s)$.
  • Figure 5: Qualitative results for VBM-Net and NBS. Note: The tail of each predicted base-pose arrow corresponds to the robot’s center.
  • ...and 3 more figures