Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

Ashish Malik; Caleb Lowe; Aayam Shrestha; Stefan Lee; Fuxin Li; Alan Fern

Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

Ashish Malik, Caleb Lowe, Aayam Shrestha, Stefan Lee, Fuxin Li, Alan Fern

Abstract

We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities. We extend existing 3D grounding models and propose Reactive Action Mask Planner (RAMP-3D), which formulates long-horizon planning as sequential reactive prediction of paired 3D masks: a "which-object" mask indicating what to pick and a "which-target-region" mask specifying where to place it. The resulting system processes RGB-D observations and natural-language task specifications to reactively generate multi-step pick-and-place actions for 3D box rearrangement. We conduct experiments across 11 task variants in warehouse-style environments with 1-30 boxes and diverse natural-language constraints. RAMP-3D achieves 79.5% success rate on long-horizon rearrangement tasks and significantly outperforms 2D VLM-based baselines, establishing mask-based reactive policies as a promising alternative to symbolic pipelines for long-horizon planning.

Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

Abstract

Paper Structure (28 sections, 5 equations, 6 figures, 4 tables)

This paper contains 28 sections, 5 equations, 6 figures, 4 tables.

Introduction
Related works
Problem Formulation
Reactive Action Mask Planner (RAMP-3D)
UniVLG Overview
Extending UniVLG to RAMP-3D
Pair-contrastive features.
Last-action predictor.
Mask-space action representation.
Training Data Generation
Simulation Environment:
Scene Initialization:
Task Variants:
Data Collection Procedure:
Natural Language Data Augmentation:
...and 13 more sections

Figures (6)

Figure 1: Box-rearrangement in warehouse environment using natural language specified goals.
Figure 2: Overview of RAMP-3D, built on UniVLG. The model takes posed multi-view RGB-D observations and a natural-language goal as inputs, encodes visual features with a transformer backbone and voxel-based 3D fusion, and encodes the goal with a frozen text encoder. A transformer decoder with learnable queries attends jointly to visual and text tokens and are augmented with pair-contrastive pickup–putdown embeddings and a binary “done’’ head. At each planning step, the model outputs a pair of 3D masks indicating the pickup target and target region and a termination probability. The predicted masks are projected back to instance IDs to yield box–region actions. The planner is run iteratively for long-horizon box rearrangement.
Figure 2: One-step plan validity in snap-to-target mode.
Figure 3: Joint target identification accuracy by number of boxes for RAMP-3D & 2D-pointer and target identification difficulty for RAMP-3D broken down by the type of targets.
Figure 4: Long-horizon plan success rates for RAMP-3D under snap-to-target and free-form action execution. Bars show the percentage of rollouts without any invalid actions, broken down by number of boxes per scene and task variants. RAMP-3D attains 79.5% success on average in snap-to-target mode and 66.5% in free-form mode.
...and 1 more figures

Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

Abstract

Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

Authors

Abstract

Table of Contents

Figures (6)