A Modern Take on Visual Relationship Reasoning for Grasp Planning
Paolo Rabino, Tatiana Tommasi
TL;DR
This work tackles grasp planning in cluttered scenes by advancing visual manipulation relationship reasoning through a new dataset, D3GD, and an end-to-end model, D3G, that jointly detects objects and predicts a spatial dependency graph via a transformer-based architecture. A threshold-free metric, $AP_{rel}$, evaluates triplet-level relationships, enabling robust comparison beyond traditional detection metrics. Empirical results show state-of-the-art performance on D3GD across difficulty levels and competitive transfer to VMRD, with analysis of design choices and ablations supporting the proposed architecture. The work provides a reusable benchmark and codebase to spur future research in robotic manipulation and relational reasoning in dense clutter.
Abstract
Interacting with real-world cluttered scenes pose several challenges to robotic agents that need to understand complex spatial dependencies among the observed objects to determine optimal pick sequences or efficient object retrieval strategies. Existing solutions typically manage simplified scenarios and focus on predicting pairwise object relationships following an initial object detection phase, but often overlook the global context or struggle with handling redundant and missing object relations. In this work, we present a modern take on visual relational reasoning for grasp planning. We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories. Additionally, we propose D3G, a new end-to-end transformer-based dependency graph generation model that simultaneously detects objects and produces an adjacency matrix representing their spatial relationships. Recognizing the limitations of standard metrics, we employ the Average Precision of Relationships for the first time to evaluate model performance, conducting an extensive experimental benchmark. The obtained results establish our approach as the new state-of-the-art for this task, laying the foundation for future research in robotic manipulation. We publicly release the code and dataset at https://paolotron.github.io/d3g.github.io.
