Table of Contents
Fetching ...

Multi-Object Graph Affordance Network: Goal-Oriented Planning through Learned Compound Object Affordances

Tuba Girgin, Emre Ugur

TL;DR

The paper addresses learning affordances for compound objects formed by arbitrary object stacks and proposes the Multi-Object Graph Affordance Network (MOGAN), which represents compounds as graphs of learned object features and predicts three-dimensional effects E1, E2, E3 when a new object is placed on top. The approach combines a depth-image autoencoder, graph neural networks (GCNConv), and a linear decoder to forecast the spatial outcomes of actions, enabling goal-oriented planning via tree search. Key contributions include a novel continuous 3D effect encoding tailored for concave/convex shapes, a graph-based multi-object representation for planning, and extensive validation in both PyBullet simulation and real UR10 experiments, showing superior planning performance over a multi-object DeepSym baseline and robust performance on nonlinear compound scenarios. The work advances robot manipulation by enabling scalable, geometry-aware reasoning over complex object assemblies and supports practical planning for stacking, insertion, and bridging tasks with real-world applicability.

Abstract

Learning object affordances is an effective tool in the field of robot learning. While the data-driven models investigate affordances of single or paired objects, there is a gap in the exploration of affordances of compound objects composed of an arbitrary number of objects. We propose the Multi-Object Graph Affordance Network which models complex compound object affordances by learning the outcomes of robot actions that facilitate interactions between an object and a compound. Given the depth images of the objects, the object features are extracted via convolution operations and encoded in the nodes of graph neural networks. Graph convolution operations are used to encode the state of the compounds, which are used as input to decoders to predict the outcome of the object-compound interactions. After learning the compound object affordances, given different tasks, the learned outcome predictors are used to plan sequences of stack actions that involve stacking objects on top of each other, inserting smaller objects into larger containers and passing through ring-like objects through poles. We showed that our system successfully modeled the affordances of compound objects that include concave and convex objects, in both simulated and real-world environments. We benchmarked our system with a baseline model to highlight its advantages.

Multi-Object Graph Affordance Network: Goal-Oriented Planning through Learned Compound Object Affordances

TL;DR

The paper addresses learning affordances for compound objects formed by arbitrary object stacks and proposes the Multi-Object Graph Affordance Network (MOGAN), which represents compounds as graphs of learned object features and predicts three-dimensional effects E1, E2, E3 when a new object is placed on top. The approach combines a depth-image autoencoder, graph neural networks (GCNConv), and a linear decoder to forecast the spatial outcomes of actions, enabling goal-oriented planning via tree search. Key contributions include a novel continuous 3D effect encoding tailored for concave/convex shapes, a graph-based multi-object representation for planning, and extensive validation in both PyBullet simulation and real UR10 experiments, showing superior planning performance over a multi-object DeepSym baseline and robust performance on nonlinear compound scenarios. The work advances robot manipulation by enabling scalable, geometry-aware reasoning over complex object assemblies and supports practical planning for stacking, insertion, and bridging tasks with real-world applicability.

Abstract

Learning object affordances is an effective tool in the field of robot learning. While the data-driven models investigate affordances of single or paired objects, there is a gap in the exploration of affordances of compound objects composed of an arbitrary number of objects. We propose the Multi-Object Graph Affordance Network which models complex compound object affordances by learning the outcomes of robot actions that facilitate interactions between an object and a compound. Given the depth images of the objects, the object features are extracted via convolution operations and encoded in the nodes of graph neural networks. Graph convolution operations are used to encode the state of the compounds, which are used as input to decoders to predict the outcome of the object-compound interactions. After learning the compound object affordances, given different tasks, the learned outcome predictors are used to plan sequences of stack actions that involve stacking objects on top of each other, inserting smaller objects into larger containers and passing through ring-like objects through poles. We showed that our system successfully modeled the affordances of compound objects that include concave and convex objects, in both simulated and real-world environments. We benchmarked our system with a baseline model to highlight its advantages.
Paper Structure (18 sections, 9 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 9 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Execution of the plan generated using our MOGAN model to build the shortest compound object given a pole and two rings in the real world setup. The agent uses the pole as the base and stacks the rings, as they do not change the height of the compound.
  • Figure 2: MOGAN: Multi-Object Graph Affordance Network Architecture, along with the pretrained autoencoder. The depth images of single objects are encoded with the autoencoder. It then constructs the graph representation of the compound object. The proposed model, MOGAN, extracts meaningful features from the graph and predicts the resulting effect between a single object and a queried object within the compound object. The predicted effect is visually depicted in the rightmost image with a dashed green circle.
  • Figure 3: Visualization of the calculation of lateral spatial displacements: Imaginary rays are projected through the center of the new object. Red points illustrate the intersections with both the compounding object and newly added object. The black arrows are calculated by the function $s$.
  • Figure 4: A PyBullet environment featuring a UR10 robot and various objects, including cubes, poles, balls, cups, and rings.
  • Figure 5: Various objects used in the real-world setup: a pole, rings, cups, a cube, and balls.
  • ...and 7 more figures