Table of Contents
Fetching ...

Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

Nicola Messina, Rosario Leonardi, Luca Ciampi, Fabio Carrara, Giovanni Maria Farinella, Fabrizio Falchi, Antonino Furnari

TL;DR

<3-5 sentence high-level summary>

Abstract

Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations $\unicode{x2013}$ natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations in a weakly-supervised regime. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations. Code and data can be found at https://fpv-iplab.github.io/WISH.

Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

TL;DR

<3-5 sentence high-level summary>

Abstract

Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations in a weakly-supervised regime. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations. Code and data can be found at https://fpv-iplab.github.io/WISH.

Paper Structure

This paper contains 44 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Models designed to solve the NS-iHOS task, like our proposed WISH, learn to segment in-hand objects using human narrations as the sole form of weak supervision; predictions are then made directly from images at test time.
  • Figure 2: The Architecture of WISH Our model operates in two stages sharing a common backbone. (a) An object segmenter and a CLIP-based backbone extract visual embeddings for all object and hand proposals. (b) In Stage 1, we learn a shared embedding space to align hand-specific noun phrases from narrations with their corresponding visual object embeddings. (c) In Stage 2, we generate pseudo-labels from this alignment to train two specialized heads: a Contactness head (C) and a Matching head (M). At test time, only the backbone and Stage 2 are used for narration-free in-hand object segmentation.
  • Figure 3: Qualitative results of WISH
  • Figure 4: Qualitative results of the pseudo-labels from Stage 2 of WISH on EPIC‑Kitchens.