A Modern Take on Visual Relationship Reasoning for Grasp Planning

Paolo Rabino; Tatiana Tommasi

A Modern Take on Visual Relationship Reasoning for Grasp Planning

Paolo Rabino, Tatiana Tommasi

TL;DR

This work tackles grasp planning in cluttered scenes by advancing visual manipulation relationship reasoning through a new dataset, D3GD, and an end-to-end model, D3G, that jointly detects objects and predicts a spatial dependency graph via a transformer-based architecture. A threshold-free metric, $AP_{rel}$, evaluates triplet-level relationships, enabling robust comparison beyond traditional detection metrics. Empirical results show state-of-the-art performance on D3GD across difficulty levels and competitive transfer to VMRD, with analysis of design choices and ablations supporting the proposed architecture. The work provides a reusable benchmark and codebase to spur future research in robotic manipulation and relational reasoning in dense clutter.

Abstract

Interacting with real-world cluttered scenes pose several challenges to robotic agents that need to understand complex spatial dependencies among the observed objects to determine optimal pick sequences or efficient object retrieval strategies. Existing solutions typically manage simplified scenarios and focus on predicting pairwise object relationships following an initial object detection phase, but often overlook the global context or struggle with handling redundant and missing object relations. In this work, we present a modern take on visual relational reasoning for grasp planning. We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories. Additionally, we propose D3G, a new end-to-end transformer-based dependency graph generation model that simultaneously detects objects and produces an adjacency matrix representing their spatial relationships. Recognizing the limitations of standard metrics, we employ the Average Precision of Relationships for the first time to evaluate model performance, conducting an extensive experimental benchmark. The obtained results establish our approach as the new state-of-the-art for this task, laying the foundation for future research in robotic manipulation. We publicly release the code and dataset at https://paolotron.github.io/d3g.github.io.

A Modern Take on Visual Relationship Reasoning for Grasp Planning

TL;DR

, evaluates triplet-level relationships, enabling robust comparison beyond traditional detection metrics. Empirical results show state-of-the-art performance on D3GD across difficulty levels and competitive transfer to VMRD, with analysis of design choices and ablations supporting the proposed architecture. The work provides a reusable benchmark and codebase to spur future research in robotic manipulation and relational reasoning in dense clutter.

Abstract

Paper Structure (15 sections, 8 equations, 5 figures, 6 tables)

This paper contains 15 sections, 8 equations, 5 figures, 6 tables.

Introduction
Related Works
Dataset
Method
Detecting Objects as Graph Nodes
Predicting Object Relations as Graph Edges
Loss Function
Metrics
Experiments
Reference Baselines
Training Procedure Details and Experimental Setup
Results on D3GD
Analysis of Design Choices
Results on VMRD
Conclusions

Figures (5)

Figure 1: Manipulation relationship graphs provide essential information to plan grasping in complex scenes as those encountered in industrial bin picking. Our work introduces a new testbed for this task ➀. Moreover, we design a novel approach ➁ that considers both object detection and edge relationship prediction in an end-to-end process. Finally, we adopt a tailored evaluation metric ➂ to assess the method performance across different levels of difficulties and challenges.
Figure 2: Data samples from the three main difficulty levels of D3GD. The objects in the bin on the left are identified by a number which is also used in the corresponding graph on the right. Here the arrows indicate a relationship dependency: A $\rightarrow$ B entails that A can't be moved without first moving B. In a downstream grasp task, the objects that should be touched first are those that don't have any outward relationships. For example, in the medium level, the green box (2) and brown box (6) can be safely grasped without causing object collapses that would alter the scene configuration.
Figure 3: Illustration of our D3G. The top row describes the transformer architecture composed of encoder and decoder. The former takes as input scene features and provides as output updated representations which are fed to the decoder together with a fixed number of object queries. The query features obtained from the decoder are filtered and matched to the ground truth objects in the scene to detect them during training. The middle row shows how the same object query representations define the edge features via pairwise concatenation and mappings, before being refined by several graph transformer layers. Finally, a binary classifier predicts from the edge features whether two objects in the scene are connected by a spatial dependency relation. The bottom row zooms inside a layer of the graph transformer assuming only one head for visual clarity.
Figure 4: Relationship loss calculation procedure. The edge prediction matrix $e_{pred}$ contains a score for each pair of object queries. Only some of the N=5 queries match the real P=3 objects. The optimal bipartite matching $\hat{\sigma}=[5,3,1,2,4]$ identifies the correspondence between the predictions $\hat{y}=[3,0,2,0,1]$ and the ground truth objects $y=[1,2,3,0,0]$, which are in order Toy Part, Spray, and Soap. Here $T$ has dimensions $5\times3$ and has value 1 in the cells that identify a match: e.g. cell (5,1) will contain a 1 because the object with ID 1 has been matched to the 5-th prediction. The matrix obtained by multiplying $T^\top \cdot e_{pred} \cdot T$ is directly comparable with the GT matrix via a simple cross-entropy loss.
Figure 5: Qualitative example of VMRN vs D3G. Left: DG3D synthetic test set. Here the object localization is almost identical and correct in the two cases but VMRN shows a wrong relationship between objects 0 and 2. The features obtained from the union area of the two bounding boxes include an intruding object causing confusion. Middle: DG3D synth-to-real test set. Our model shows good synth-to-real transfer while VMRN suffers from the domain shift and detects a false positive. Right: VMRD test set. Both models perform well on the real data but VMRN predicts a wrong relationship.

A Modern Take on Visual Relationship Reasoning for Grasp Planning

TL;DR

Abstract

A Modern Take on Visual Relationship Reasoning for Grasp Planning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)