Table of Contents
Fetching ...

Affordance segmentation of hand-occluded containers from exocentric images

Tommaso Apicella, Alessio Xompero, Edoardo Ragusa, Riccardo Berta, Andrea Cavallaro, Paolo Gastaldo

TL;DR

This paper tackles visual affordance segmentation for hand-occluded, hand-held containers in exocentric imagery. It introduces ACANet, a multi-branch UNet-like network that separately learns hand and object features and fuses them through a region-aware fusion module to robustly predict affordances despite occlusion. Trained on CHOC-AFF, a mixed-reality dataset built by annotating CHOC with affordances, ACANet demonstrates superior generalisation to real images (HO-3D, CCM) and better performance on graspable and contain classes compared to baselines, albeit with higher computational cost. The work advances practical affordance understanding in human-robot interaction by explicitly modeling hand and object regions and validating across diverse backgrounds and real-world scenarios.

Abstract

Visual affordance segmentation identifies the surfaces of an object an agent can interact with. Common challenges for the identification of affordances are the variety of the geometry and physical properties of these surfaces as well as occlusions. In this paper, we focus on occlusions of an object that is hand-held by a person manipulating it. To address this challenge, we propose an affordance segmentation model that uses auxiliary branches to process the object and hand regions separately. The proposed model learns affordance features under hand-occlusion by weighting the feature map through hand and object segmentation. To train the model, we annotated the visual affordances of an existing dataset with mixed-reality images of hand-held containers in third-person (exocentric) images. Experiments on both real and mixed-reality images show that our model achieves better affordance segmentation and generalisation than existing models.

Affordance segmentation of hand-occluded containers from exocentric images

TL;DR

This paper tackles visual affordance segmentation for hand-occluded, hand-held containers in exocentric imagery. It introduces ACANet, a multi-branch UNet-like network that separately learns hand and object features and fuses them through a region-aware fusion module to robustly predict affordances despite occlusion. Trained on CHOC-AFF, a mixed-reality dataset built by annotating CHOC with affordances, ACANet demonstrates superior generalisation to real images (HO-3D, CCM) and better performance on graspable and contain classes compared to baselines, albeit with higher computational cost. The work advances practical affordance understanding in human-robot interaction by explicitly modeling hand and object regions and validating across diverse backgrounds and real-world scenarios.

Abstract

Visual affordance segmentation identifies the surfaces of an object an agent can interact with. Common challenges for the identification of affordances are the variety of the geometry and physical properties of these surfaces as well as occlusions. In this paper, we focus on occlusions of an object that is hand-held by a person manipulating it. To address this challenge, we propose an affordance segmentation model that uses auxiliary branches to process the object and hand regions separately. The proposed model learns affordance features under hand-occlusion by weighting the feature map through hand and object segmentation. To train the model, we annotated the visual affordances of an existing dataset with mixed-reality images of hand-held containers in third-person (exocentric) images. Experiments on both real and mixed-reality images show that our model achieves better affordance segmentation and generalisation than existing models.
Paper Structure (17 sections, 7 equations, 4 figures, 2 tables)

This paper contains 17 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Predictions of the proposed visual affordance segmentation model on RGB images of hand-occluded containers. These images (background and object instances) are never seen during training and do not belong to any of the datasets. Key: graspable, contain, arm.
  • Figure 2: ACANet, our proposed model for Arm-Container Affordance segmentation of hand-held containers. The fusion block is highlighted in yellow.
  • Figure 3: Samples of cropped RGB images and segmentation maps of arms and object affordances from the annotated mixed-reality dataset, CHOC-AFF. Key: background, graspable, contain, arm
  • Figure 4: Comparison of the predicted affordance and hand masks of the models on sampled images from the four testing sets. The segmentation masks are overlayed on the RGB images. KEY - GT: ground-truth, graspable, contain, arm.