Learning secondary tool affordances of human partners using iCub robot's egocentric data
Bosong Ding, Erhan Oztop, Giacomo Spigler, Murat Kirtay
TL;DR
This work tackles learning secondary tool affordances by having the iCub robot observe humans using tools from an egocentric, multi-camera setup. By framing two tasks—tool recognition and joint tool–action recognition—the authors benchmark ResNet-based architectures (18/50/101) across five input schemes, finding that a shared-weight, single-camera representative (1C-1N with ResNet-50) yields the strongest performance, with high accuracies for tools and tool–action pairs. The study demonstrates that secondary affordances can be learned from human demonstrations in real-world settings, enabling more nuanced human–robot collaboration in object manipulation. The dataset and benchmarking provide a foundation for robots to infer diverse tool uses beyond primary purposes, suggesting practical impact for tasks like assembly and assistance that rely on understanding human tool use.
Abstract
Objects, in particular tools, provide several action possibilities to the agents that can act on them, which are generally associated with the term of affordances. A tool is typically designed for a specific purpose, such as driving a nail in the case of a hammer, which we call as the primary affordance. A tool can also be used beyond its primary purpose, in which case we can associate this auxiliary use with the term secondary affordance. Previous work on affordance perception and learning has been mostly focused on primary affordances. Here, we address the less explored problem of learning the secondary tool affordances of human partners. To do this, we use the iCub robot to observe human partners with three cameras while they perform actions on twenty objects using four different tools. In our experiments, human partners utilize tools to perform actions that do not correspond to their primary affordances. For example, the iCub robot observes a human partner using a ruler for pushing, pulling, and moving objects instead of measuring their lengths. In this setting, we constructed a dataset by taking images of objects before and after each action is executed. We then model learning secondary affordances by training three neural networks (ResNet-18, ResNet-50, and ResNet-101) each on three tasks, using raw images showing the `initial' and `final' position of objects as input: (1) predicting the tool used to move an object, (2) predicting the tool used with an additional categorical input that encoded the action performed, and (3) joint prediction of both tool used and action performed. Our results indicate that deep learning architectures enable the iCub robot to predict secondary tool affordances, thereby paving the road for human-robot collaborative object manipulation involving complex affordances.
