Table of Contents
Fetching ...

BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Vineet Bhat, Sungsu Kim, Valts Blukis, Greg Heinrich, Prashanth Krishnamurthy, Ramesh Karri, Stan Birchfield, Farshad Khorrami, Jonathan Tremblay

TL;DR

BOP-Ask addresses a gap in vision-language models by focusing on fine-grained object-interaction reasoning in cluttered environments. It introduces a large-scale, geometry-grounded dataset built on 6D poses from BOP, with six reasoning skills and pixel-level QA, plus core and lab benchmarks to test generalization. Empirical results show that fine-tuning VLMs on BOP-Ask improves 3D grounding, grasping, and trajectory planning, and transfers to out-of-distribution benchmarks and real robot tasks, though some tasks remain challenging. The work provides a practical path toward embodied spatial understanding and manipulation for VLM-based robotic systems, with extensive ablations and real-robot demonstrations supporting its claims.

Abstract

Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.

BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

TL;DR

BOP-Ask addresses a gap in vision-language models by focusing on fine-grained object-interaction reasoning in cluttered environments. It introduces a large-scale, geometry-grounded dataset built on 6D poses from BOP, with six reasoning skills and pixel-level QA, plus core and lab benchmarks to test generalization. Empirical results show that fine-tuning VLMs on BOP-Ask improves 3D grounding, grasping, and trajectory planning, and transfers to out-of-distribution benchmarks and real robot tasks, though some tasks remain challenging. The work provides a practical path toward embodied spatial understanding and manipulation for VLM-based robotic systems, with extensive ablations and real-robot demonstrations supporting its claims.

Abstract

Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.

Paper Structure

This paper contains 25 sections, 1 equation, 6 figures, 9 tables.

Figures (6)

  • Figure 1: The BOP-Ask dataset facilitates object-interaction reasoning for robot manipulation. This illustration demonstrates how a model trained on BOP-Ask enables human and robot-aligned spatial understanding for different actions, supporting physical relationship, locating where to grasp objects, precise pose estimation, and motion planning between objects.
  • Figure 2: Overview of the BOP-Ask dataset. We automatically generate object-interaction and spatial reasoning annotations from 3D point clouds, images, object poses and 3D models with description. We create question/answer pairs covering 6 types of questions (from left to right, top to bottom), object pose estimation, grasp affordance, motion planning, physical interaction, object relationship, and depth relationship.
  • Figure 3: Predictions from samples in BOP-Ask-core and BOP-Ask-lab (identified by (lab)), showing improvements gained from fine-tuning on BOP-Ask. Predictions from NVILA (shown in magenta) and NVILA SFT (shown in blue) are shown alongside the Ground Truth (in green). For the 'Rearrangement' task, the Ground Truth shape delineates the area of valid predictions. Absence of a colored prediction indicates none was made or it was out of frame. Images are from HOPE tyree2022hope, HANDAL handal, and YCB-V posecnn.
  • Figure 4: Our proposed data generation framework can transform all the 6D pose annotated RGB-D images within BOP into a robotics ready precise spatio-geometric reasoning benchmark.
  • Figure 5: Distribution of free form questions and task types in BOP-Ask.
  • ...and 1 more figures