Multi-Object Graph Affordance Network: Goal-Oriented Planning through Learned Compound Object Affordances
Tuba Girgin, Emre Ugur
TL;DR
The paper addresses learning affordances for compound objects formed by arbitrary object stacks and proposes the Multi-Object Graph Affordance Network (MOGAN), which represents compounds as graphs of learned object features and predicts three-dimensional effects E1, E2, E3 when a new object is placed on top. The approach combines a depth-image autoencoder, graph neural networks (GCNConv), and a linear decoder to forecast the spatial outcomes of actions, enabling goal-oriented planning via tree search. Key contributions include a novel continuous 3D effect encoding tailored for concave/convex shapes, a graph-based multi-object representation for planning, and extensive validation in both PyBullet simulation and real UR10 experiments, showing superior planning performance over a multi-object DeepSym baseline and robust performance on nonlinear compound scenarios. The work advances robot manipulation by enabling scalable, geometry-aware reasoning over complex object assemblies and supports practical planning for stacking, insertion, and bridging tasks with real-world applicability.
Abstract
Learning object affordances is an effective tool in the field of robot learning. While the data-driven models investigate affordances of single or paired objects, there is a gap in the exploration of affordances of compound objects composed of an arbitrary number of objects. We propose the Multi-Object Graph Affordance Network which models complex compound object affordances by learning the outcomes of robot actions that facilitate interactions between an object and a compound. Given the depth images of the objects, the object features are extracted via convolution operations and encoded in the nodes of graph neural networks. Graph convolution operations are used to encode the state of the compounds, which are used as input to decoders to predict the outcome of the object-compound interactions. After learning the compound object affordances, given different tasks, the learned outcome predictors are used to plan sequences of stack actions that involve stacking objects on top of each other, inserting smaller objects into larger containers and passing through ring-like objects through poles. We showed that our system successfully modeled the affordances of compound objects that include concave and convex objects, in both simulated and real-world environments. We benchmarked our system with a baseline model to highlight its advantages.
