Large Pre-Trained Models for Bimanual Manipulation in 3D
Hanna Yurchyk, Wei-Di Chang, Gregory Dudek, David Meger
TL;DR
This paper addresses the challenge of robust bimanual robotic manipulation by injecting semantic priors from pre-trained Vision Transformers into a voxel-based 3D representation. The method extracts DINOv2 attention maps from RGB inputs, lifts them into a 3D voxel grid, and appends these semantic cues as an extra feature channel in a VoxAct-B–based policy trained via behavior cloning. The key contributions include a lightweight voxel featurization pipeline that requires no fine-tuning of the ViT or modifications to downstream policy architecture, and substantial performance gains on the RLBench bimanual benchmark (average 8.2% absolute and 21.9% relative improvements). Ablation studies show that using multiple attention heads yields even larger gains (up to 29.0% absolute on a task like open jar), highlighting the usefulness of large-scale visual priors for 3D spatial reasoning in manipulation. Overall, the work demonstrates that semantic information from large pre-trained visual models can be effectively integrated into structured 3D representations to improve generalization and efficiency in dual-arm robotic control.
Abstract
We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.
