Table of Contents
Fetching ...

Large Pre-Trained Models for Bimanual Manipulation in 3D

Hanna Yurchyk, Wei-Di Chang, Gregory Dudek, David Meger

TL;DR

This paper addresses the challenge of robust bimanual robotic manipulation by injecting semantic priors from pre-trained Vision Transformers into a voxel-based 3D representation. The method extracts DINOv2 attention maps from RGB inputs, lifts them into a 3D voxel grid, and appends these semantic cues as an extra feature channel in a VoxAct-B–based policy trained via behavior cloning. The key contributions include a lightweight voxel featurization pipeline that requires no fine-tuning of the ViT or modifications to downstream policy architecture, and substantial performance gains on the RLBench bimanual benchmark (average 8.2% absolute and 21.9% relative improvements). Ablation studies show that using multiple attention heads yields even larger gains (up to 29.0% absolute on a task like open jar), highlighting the usefulness of large-scale visual priors for 3D spatial reasoning in manipulation. Overall, the work demonstrates that semantic information from large pre-trained visual models can be effectively integrated into structured 3D representations to improve generalization and efficiency in dual-arm robotic control.

Abstract

We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

Large Pre-Trained Models for Bimanual Manipulation in 3D

TL;DR

This paper addresses the challenge of robust bimanual robotic manipulation by injecting semantic priors from pre-trained Vision Transformers into a voxel-based 3D representation. The method extracts DINOv2 attention maps from RGB inputs, lifts them into a 3D voxel grid, and appends these semantic cues as an extra feature channel in a VoxAct-B–based policy trained via behavior cloning. The key contributions include a lightweight voxel featurization pipeline that requires no fine-tuning of the ViT or modifications to downstream policy architecture, and substantial performance gains on the RLBench bimanual benchmark (average 8.2% absolute and 21.9% relative improvements). Ablation studies show that using multiple attention heads yields even larger gains (up to 29.0% absolute on a task like open jar), highlighting the usefulness of large-scale visual priors for 3D spatial reasoning in manipulation. Overall, the work demonstrates that semantic information from large pre-trained visual models can be effectively integrated into structured 3D representations to improve generalization and efficiency in dual-arm robotic control.

Abstract

We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

Paper Structure

This paper contains 27 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the voxel featurization pipeline. Attention maps are extracted from RGB camera views using DINOv2 oquab2024dinov2 and projected into a shared 3D voxel grid using multi-view triangulation. RGB and attention features are averaged per voxel.
  • Figure 2: Overview of the bimanual RLBench liu2024voxactb manipulation tasks and voxelization examples. Each task includes language instructions specific to the left and right arms, along with variations in the initial position, size, and color of the relevant object. The center column illustrates voxelization from RGB input, while the right column shows the featurized voxel grid augmented with DINOv2 attention. Brighter regions indicate higher attention values and are closer to one in numerical value.
  • Figure 3: Overview of the voxel featurization and learning pipeline. At each time step t, RGB-D observations are processed by DINOv2 to extract attention maps, which are projected into a shared 3D voxel grid alongside RGB features, geometric coordinates, and occupancy values. The voxelized scene is then combined with proprioceptive inputs from both arms. A language instruction is encoded using CLIP and fused with the scene and proprioceptive representations. The resulting input is flattened and passed to a PerceiverIO policy learner, which predicts the action for time step t+1.
  • Figure 4: Training loss averaged over five seeds for all tasks, trained on 10 demonstrations. The curves are smoothed for better interpretation.