Affordance segmentation of hand-occluded containers from exocentric images
Tommaso Apicella, Alessio Xompero, Edoardo Ragusa, Riccardo Berta, Andrea Cavallaro, Paolo Gastaldo
TL;DR
This paper tackles visual affordance segmentation for hand-occluded, hand-held containers in exocentric imagery. It introduces ACANet, a multi-branch UNet-like network that separately learns hand and object features and fuses them through a region-aware fusion module to robustly predict affordances despite occlusion. Trained on CHOC-AFF, a mixed-reality dataset built by annotating CHOC with affordances, ACANet demonstrates superior generalisation to real images (HO-3D, CCM) and better performance on graspable and contain classes compared to baselines, albeit with higher computational cost. The work advances practical affordance understanding in human-robot interaction by explicitly modeling hand and object regions and validating across diverse backgrounds and real-world scenarios.
Abstract
Visual affordance segmentation identifies the surfaces of an object an agent can interact with. Common challenges for the identification of affordances are the variety of the geometry and physical properties of these surfaces as well as occlusions. In this paper, we focus on occlusions of an object that is hand-held by a person manipulating it. To address this challenge, we propose an affordance segmentation model that uses auxiliary branches to process the object and hand regions separately. The proposed model learns affordance features under hand-occlusion by weighting the feature map through hand and object segmentation. To train the model, we annotated the visual affordances of an existing dataset with mixed-reality images of hand-held containers in third-person (exocentric) images. Experiments on both real and mixed-reality images show that our model achieves better affordance segmentation and generalisation than existing models.
