Table of Contents
Fetching ...

Visual-Geometric Collaborative Guidance for Affordance Learning

Hongchen Luo, Wei Zhai, Jiao Wang, Yang Cao, Zheng-Jun Zha

TL;DR

A visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues to excavate interactive affinity from human-object interactions jointly is proposed and Experimental results demonstrate that the method outperforms the representative models regarding objective metrics and visual quality.

Abstract

Perceiving potential ``action possibilities'' (\ie, affordance) regions of images and learning interactive functionalities of objects from human demonstration is a challenging task due to the diversity of human-object interactions. Prevailing affordance learning algorithms often adopt the label assignment paradigm and presume that there is a unique relationship between functional region and affordance label, yielding poor performance when adapting to unseen environments with large appearance variations. In this paper, we propose to leverage interactive affinity for affordance learning, \ie extracting interactive affinity from human-object interaction and transferring it to non-interactive objects. Interactive affinity, which represents the contacts between different parts of the human body and local regions of the target object, can provide inherent cues of interconnectivity between humans and objects, thereby reducing the ambiguity of the perceived action possibilities. To this end, we propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues to excavate interactive affinity from human-object interactions jointly. Besides, a contact-driven affordance learning (CAL) dataset is constructed by collecting and labeling over 55,047 images. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality. Project: \href{https://github.com/lhc1224/VCR-Net}{github.com/lhc1224/VCR-Net}.

Visual-Geometric Collaborative Guidance for Affordance Learning

TL;DR

A visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues to excavate interactive affinity from human-object interactions jointly is proposed and Experimental results demonstrate that the method outperforms the representative models regarding objective metrics and visual quality.

Abstract

Perceiving potential ``action possibilities'' (\ie, affordance) regions of images and learning interactive functionalities of objects from human demonstration is a challenging task due to the diversity of human-object interactions. Prevailing affordance learning algorithms often adopt the label assignment paradigm and presume that there is a unique relationship between functional region and affordance label, yielding poor performance when adapting to unseen environments with large appearance variations. In this paper, we propose to leverage interactive affinity for affordance learning, \ie extracting interactive affinity from human-object interaction and transferring it to non-interactive objects. Interactive affinity, which represents the contacts between different parts of the human body and local regions of the target object, can provide inherent cues of interconnectivity between humans and objects, thereby reducing the ambiguity of the perceived action possibilities. To this end, we propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues to excavate interactive affinity from human-object interactions jointly. Besides, a contact-driven affordance learning (CAL) dataset is constructed by collecting and labeling over 55,047 images. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality. Project: \href{https://github.com/lhc1224/VCR-Net}{github.com/lhc1224/VCR-Net}.

Paper Structure

This paper contains 17 sections, 15 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Interactive affinity. (a) Interaction affinity refers to the contact between different parts of the human body and the local regions of a target object. (b) The interactive affinity provides rich cues to guide the model to acquire invariant features of the object's local regions interacting with the body part, thus counteracting the multiple possibilities caused by diverse interactions.
  • Figure 2: Motivation. (a) We consider both the semantic and structural cues to extract the interactive affinity from the interaction images. (b) We exploit the implicit structural cues of body pose and apparent similarity to jointly perform the interactive affinity transfer.
  • Figure 3: Overview of the proposed VCR-Net. Our approach mainly consists of three parts: feature extraction, semantic-pose heuristic perception module (Sec. \ref{['sec:IAE']}) and geometric-apparent alignment transfer module (Sec. \ref{['sec:IAT']}).
  • Figure 4: DEQ fuse layer. The DEQ fuse layer consists of a transformation $f_\theta$ that is driven to equilibrium between different input features.
  • Figure 5: Dataset image examples. Some examples of images and annotations from the CAL dataset.
  • ...and 12 more figures