Table of Contents
Fetching ...

Segmenting Object Affordances: Reproducibility and Sensitivity to Scale

Tommaso Apicella, Alessio Xompero, Paolo Gastaldo, Andrea Cavallaro

TL;DR

This paper addresses reproducibility issues in visual affordance segmentation by re-implementing and retraining existing methods under a common framework and evaluation setup, and by introducing Mask2Former-based affordance segmentation (M2F-AFF). It benchmarks two single-object scenarios—unoccluded tabletop objects and hand-held containers—across real and synthetic datasets to assess generalization and scale robustness. The study finds that M2F-AFF often outperforms prior methods on tabletop and many hand-occluded settings, while highlighting persistent sensitivity to object scale and occupancy that challenges cross-dataset transfer. The work provides a fair baseline for future comparisons and proposes extending benchmarks to multi-object and more occlusion-heavy scenarios to further advance reproducibility and robustness in affordance segmentation.

Abstract

Visual affordance segmentation identifies image regions of an object an agent can interact with. Existing methods re-use and adapt learning-based architectures for semantic segmentation to the affordance segmentation task and evaluate on small-size datasets. However, experimental setups are often not reproducible, thus leading to unfair and inconsistent comparisons. In this work, we benchmark these methods under a reproducible setup on two single objects scenarios, tabletop without occlusions and hand-held containers, to facilitate future comparisons. We include a version of a recent architecture, Mask2Former, re-trained for affordance segmentation and show that this model is the best-performing on most testing sets of both scenarios. Our analysis shows that models are not robust to scale variations when object resolutions differ from those in the training set.

Segmenting Object Affordances: Reproducibility and Sensitivity to Scale

TL;DR

This paper addresses reproducibility issues in visual affordance segmentation by re-implementing and retraining existing methods under a common framework and evaluation setup, and by introducing Mask2Former-based affordance segmentation (M2F-AFF). It benchmarks two single-object scenarios—unoccluded tabletop objects and hand-held containers—across real and synthetic datasets to assess generalization and scale robustness. The study finds that M2F-AFF often outperforms prior methods on tabletop and many hand-occluded settings, while highlighting persistent sensitivity to object scale and occupancy that challenges cross-dataset transfer. The work provides a fair baseline for future comparisons and proposes extending benchmarks to multi-object and more occlusion-heavy scenarios to further advance reproducibility and robustness in affordance segmentation.

Abstract

Visual affordance segmentation identifies image regions of an object an agent can interact with. Existing methods re-use and adapt learning-based architectures for semantic segmentation to the affordance segmentation task and evaluate on small-size datasets. However, experimental setups are often not reproducible, thus leading to unfair and inconsistent comparisons. In this work, we benchmark these methods under a reproducible setup on two single objects scenarios, tabletop without occlusions and hand-held containers, to facilitate future comparisons. We include a version of a recent architecture, Mask2Former, re-trained for affordance segmentation and show that this model is the best-performing on most testing sets of both scenarios. Our analysis shows that models are not robust to scale variations when object resolutions differ from those in the training set.
Paper Structure (19 sections, 9 equations, 9 figures, 5 tables)

This paper contains 19 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison of segmentation accuracy between methods on the testing set of UMD myers2015affordance. Methods are re-implemented and trained using the same experimental setup. KEYS -- G: grasp, CU: cut, SC: scoop, CO: contain, P: pound, SU: support, WG: wrap-grasp, AVG: average, CNNnguyen2016detecting, AffordanceNetdo2018affordancenet, DRNAttgu2021visual, M2F-AFF.
  • Figure 1: Statistics of object pixels occupancy in unoccluded and hand-occluded testing sets.
  • Figure 2: Comparison between the ground truth and models prediction on the UMD testing set. Legend: graspable, cut, scoop, contain, pound, support, wrap-grasp
  • Figure 2: Comparison between the segmentation annotation, and predictions of DRNAtt gu2021visual and M2F-AFF on UMD testing set varying object scale. Legend: graspable, cut, scoop, contain, pound, support, wrap-grasp.
  • Figure 3: Comparison of the affordance and arm segmentation results between the models on the two mixed-reality testing sets (top row) and on the two real testing sets (bottom row). Note the differnt y-axis limits. KEYS -- G: grasp, CO: contain, A: arm, AVG: average, RN50Fhussain2020fpha, DRNAttgu2021visual, RN18-Uapicella2023affordance, ACANetapicella2023affordance, ACANet50apicella2023affordance, M2F-AFFcheng2022masked.
  • ...and 4 more figures