Table of Contents
Fetching ...

Learning secondary tool affordances of human partners using iCub robot's egocentric data

Bosong Ding, Erhan Oztop, Giacomo Spigler, Murat Kirtay

TL;DR

This work tackles learning secondary tool affordances by having the iCub robot observe humans using tools from an egocentric, multi-camera setup. By framing two tasks—tool recognition and joint tool–action recognition—the authors benchmark ResNet-based architectures (18/50/101) across five input schemes, finding that a shared-weight, single-camera representative (1C-1N with ResNet-50) yields the strongest performance, with high accuracies for tools and tool–action pairs. The study demonstrates that secondary affordances can be learned from human demonstrations in real-world settings, enabling more nuanced human–robot collaboration in object manipulation. The dataset and benchmarking provide a foundation for robots to infer diverse tool uses beyond primary purposes, suggesting practical impact for tasks like assembly and assistance that rely on understanding human tool use.

Abstract

Objects, in particular tools, provide several action possibilities to the agents that can act on them, which are generally associated with the term of affordances. A tool is typically designed for a specific purpose, such as driving a nail in the case of a hammer, which we call as the primary affordance. A tool can also be used beyond its primary purpose, in which case we can associate this auxiliary use with the term secondary affordance. Previous work on affordance perception and learning has been mostly focused on primary affordances. Here, we address the less explored problem of learning the secondary tool affordances of human partners. To do this, we use the iCub robot to observe human partners with three cameras while they perform actions on twenty objects using four different tools. In our experiments, human partners utilize tools to perform actions that do not correspond to their primary affordances. For example, the iCub robot observes a human partner using a ruler for pushing, pulling, and moving objects instead of measuring their lengths. In this setting, we constructed a dataset by taking images of objects before and after each action is executed. We then model learning secondary affordances by training three neural networks (ResNet-18, ResNet-50, and ResNet-101) each on three tasks, using raw images showing the `initial' and `final' position of objects as input: (1) predicting the tool used to move an object, (2) predicting the tool used with an additional categorical input that encoded the action performed, and (3) joint prediction of both tool used and action performed. Our results indicate that deep learning architectures enable the iCub robot to predict secondary tool affordances, thereby paving the road for human-robot collaborative object manipulation involving complex affordances.

Learning secondary tool affordances of human partners using iCub robot's egocentric data

TL;DR

This work tackles learning secondary tool affordances by having the iCub robot observe humans using tools from an egocentric, multi-camera setup. By framing two tasks—tool recognition and joint tool–action recognition—the authors benchmark ResNet-based architectures (18/50/101) across five input schemes, finding that a shared-weight, single-camera representative (1C-1N with ResNet-50) yields the strongest performance, with high accuracies for tools and tool–action pairs. The study demonstrates that secondary affordances can be learned from human demonstrations in real-world settings, enabling more nuanced human–robot collaboration in object manipulation. The dataset and benchmarking provide a foundation for robots to infer diverse tool uses beyond primary purposes, suggesting practical impact for tasks like assembly and assistance that rely on understanding human tool use.

Abstract

Objects, in particular tools, provide several action possibilities to the agents that can act on them, which are generally associated with the term of affordances. A tool is typically designed for a specific purpose, such as driving a nail in the case of a hammer, which we call as the primary affordance. A tool can also be used beyond its primary purpose, in which case we can associate this auxiliary use with the term secondary affordance. Previous work on affordance perception and learning has been mostly focused on primary affordances. Here, we address the less explored problem of learning the secondary tool affordances of human partners. To do this, we use the iCub robot to observe human partners with three cameras while they perform actions on twenty objects using four different tools. In our experiments, human partners utilize tools to perform actions that do not correspond to their primary affordances. For example, the iCub robot observes a human partner using a ruler for pushing, pulling, and moving objects instead of measuring their lengths. In this setting, we constructed a dataset by taking images of objects before and after each action is executed. We then model learning secondary affordances by training three neural networks (ResNet-18, ResNet-50, and ResNet-101) each on three tasks, using raw images showing the `initial' and `final' position of objects as input: (1) predicting the tool used to move an object, (2) predicting the tool used with an additional categorical input that encoded the action performed, and (3) joint prediction of both tool used and action performed. Our results indicate that deep learning architectures enable the iCub robot to predict secondary tool affordances, thereby paving the road for human-robot collaborative object manipulation involving complex affordances.
Paper Structure (15 sections, 4 figures, 2 tables)

This paper contains 15 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The experimental setup (a), where an operator performs a pull action on a wooden cube with a ruler as a tool, the objects (b) and tools (c) that were employed to construct the dataset.
  • Figure 2: (a) Initial pose of the object and (b) final pose of the object after performing an action (left to right) with ruler as a tool.
  • Figure 3: Schematic representation of the ResNet-50 architecture for tool-action pair prediction described in Section \ref{['toool_task']}. The architecture uses paired initial and final images processed through shared-weight ResNet layers. The feature maps generated using two independent branches are merged and fed into a two-headed classifier to simultaneously predict the tools and actions.
  • Figure 4: Normalized Confusion Matrices for ResNet50-based 1C-1N Architecture: (a) Tool-Only Recognition with Action Reference, (b) Tool-Only Recognition without Action Reference, (c) Tool Recognition Output Head, and (d) Action Recognition Output Head. The visualization of these matrices demonstrates the model's high accuracy in different scenarios, with the action recognition component achieving near-perfect accuracy across all actions and tool recognition showing robust performance.