Segmenting Object Affordances: Reproducibility and Sensitivity to Scale
Tommaso Apicella, Alessio Xompero, Paolo Gastaldo, Andrea Cavallaro
TL;DR
This paper addresses reproducibility issues in visual affordance segmentation by re-implementing and retraining existing methods under a common framework and evaluation setup, and by introducing Mask2Former-based affordance segmentation (M2F-AFF). It benchmarks two single-object scenarios—unoccluded tabletop objects and hand-held containers—across real and synthetic datasets to assess generalization and scale robustness. The study finds that M2F-AFF often outperforms prior methods on tabletop and many hand-occluded settings, while highlighting persistent sensitivity to object scale and occupancy that challenges cross-dataset transfer. The work provides a fair baseline for future comparisons and proposes extending benchmarks to multi-object and more occlusion-heavy scenarios to further advance reproducibility and robustness in affordance segmentation.
Abstract
Visual affordance segmentation identifies image regions of an object an agent can interact with. Existing methods re-use and adapt learning-based architectures for semantic segmentation to the affordance segmentation task and evaluate on small-size datasets. However, experimental setups are often not reproducible, thus leading to unfair and inconsistent comparisons. In this work, we benchmark these methods under a reproducible setup on two single objects scenarios, tabletop without occlusions and hand-held containers, to facilitate future comparisons. We include a version of a recent architecture, Mask2Former, re-trained for affordance segmentation and show that this model is the best-performing on most testing sets of both scenarios. Our analysis shows that models are not robust to scale variations when object resolutions differ from those in the training set.
