Table of Contents
Fetching ...

Cross-Embodied Affordance Transfer through Learning Affordance Equivalences

Hakan Aktas, Yukie Nagai, Minoru Asada, Matteo Saveriano, Erhan Oztop, Emre Ugur

TL;DR

This work addresses how to learn affordances that couple objects, actions, and effects across agents by formulating a shared affordance space and Affordance Equivalence. It introduces a multi-channel CNMP-based architecture that encodes object depth maps and time-series of actions and effects into latent vectors, blends them into a common representation $L^F$ via convex weights, and decodes to complete affordance components, enabling cross-embodiment transfer and direct imitation. The authors validate their approach through insertability, graspability, and rollability experiments, plus a real-robot imitation test, demonstrating object- and agent-level equivalences and transfer across diverse robots and objects. Results show the proposed method outperforms baselines in reconstruction and transfer tasks and can operate with partial input channels, highlighting practical potential for cross-robot skill transfer. Limitations include the need to retrain when adding new robots and assumptions about time-series consistency, with future work aimed at more diverse morphologies and ambiguity handling.

Abstract

Affordances represent the inherent effect and action possibilities that objects offer to the agents within a given context. From a theoretical viewpoint, affordances bridge the gap between effect and action, providing a functional understanding of the connections between the actions of an agent and its environment in terms of the effects it can cause. In this study, we propose a deep neural network model that unifies objects, actions, and effects into a single latent vector in a common latent space that we call the affordance space. Using the affordance space, our system can generate effect trajectories when action and object are given and can generate action trajectories when effect trajectories and objects are given. Our model does not learn the behavior of individual objects acted upon by a single agent. Still, rather, it forms a `shared affordance representation' spanning multiple agents and objects, which we call Affordance Equivalence. Affordance Equivalence facilitates not only action generalization over objects but also Cross Embodiment transfer linking actions of different robots. In addition to the simulation experiments that demonstrate the proposed model's range of capabilities, we also showcase that our model can be used for direct imitation in real-world settings.

Cross-Embodied Affordance Transfer through Learning Affordance Equivalences

TL;DR

This work addresses how to learn affordances that couple objects, actions, and effects across agents by formulating a shared affordance space and Affordance Equivalence. It introduces a multi-channel CNMP-based architecture that encodes object depth maps and time-series of actions and effects into latent vectors, blends them into a common representation via convex weights, and decodes to complete affordance components, enabling cross-embodiment transfer and direct imitation. The authors validate their approach through insertability, graspability, and rollability experiments, plus a real-robot imitation test, demonstrating object- and agent-level equivalences and transfer across diverse robots and objects. Results show the proposed method outperforms baselines in reconstruction and transfer tasks and can operate with partial input channels, highlighting practical potential for cross-robot skill transfer. Limitations include the need to retrain when adding new robots and assumptions about time-series consistency, with future work aimed at more diverse morphologies and ambiguity handling.

Abstract

Affordances represent the inherent effect and action possibilities that objects offer to the agents within a given context. From a theoretical viewpoint, affordances bridge the gap between effect and action, providing a functional understanding of the connections between the actions of an agent and its environment in terms of the effects it can cause. In this study, we propose a deep neural network model that unifies objects, actions, and effects into a single latent vector in a common latent space that we call the affordance space. Using the affordance space, our system can generate effect trajectories when action and object are given and can generate action trajectories when effect trajectories and objects are given. Our model does not learn the behavior of individual objects acted upon by a single agent. Still, rather, it forms a `shared affordance representation' spanning multiple agents and objects, which we call Affordance Equivalence. Affordance Equivalence facilitates not only action generalization over objects but also Cross Embodiment transfer linking actions of different robots. In addition to the simulation experiments that demonstrate the proposed model's range of capabilities, we also showcase that our model can be used for direct imitation in real-world settings.
Paper Structure (11 sections, 19 equations, 9 figures, 5 tables)

This paper contains 11 sections, 19 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of the proposed model (refer to the text for the details of training and generation). The model has a channel for each element of the affordance tuple. The action channel can be comprised of several channels for different agents involved. The system is able to predict missing elements of the affordance tuple given information about the other components.
  • Figure 2: The setup of the experiment conducted in Section \ref{['section:insetion']} (on the left) and exemplary results of the same experiment. As seen from the effect plots in the middle, when the opening is insertable the force readings peak earlier (middle left) than when it is not insertable (middle right). The plots on the right show the action generation results where the solid lines show the generated trajectories and the dashed lines show the ground truth values (dashed lines are hard to see since they overlap).
  • Figure 3: The depth images of non-insertable (top row) and insertable (bottom row) openings used for training in the experiment in Section \ref{['section:insetion']} and the generated depth images reconstructed using common affordance representations (on the right).
  • Figure 4: Latent space analysis of the formed affordance representations of the experiment in Section \ref{['section:insetion']}. Each different shape shows a different object. It can be seen that as training progresses, half of the objects converge to one point (insertable), and the other half converges to the other point (not insertable).
  • Figure 5: The grasp actions used in the experiment in Section \ref{['section:grasp']} are illustrated.
  • ...and 4 more figures