Table of Contents
Fetching ...

Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein

TL;DR

The paper tackles context-aware human affordance generation in complex 2D scenes, where no real person is present for supervision. It introduces a mutual cross-modal attention (MCMA) framework to fuse image and segmentation features into a rich scene context $F^{context}$, enabling an automated four-stage pipeline: sample a probable location with a VAE conditioned on global context, select a pose template with a classifier, and independently sample scale $s^*$ and deformation $d^*$ with two dedicated VAEs before applying a target transformation. Key contributions include the MCMA context representation, a fully automated inference pipeline, and extensive ablations showing that semantic context and modular VAEs improve pose plausibility and alignment (PCK, PCKh, IOU) while achieving better qualitative realism and user preference. The approach demonstrates strong potential for downstream rendering, virtual humans in scenes, and synthetic data generation, with practical impact in AR/VR, graphics, and scene understanding.

Abstract

Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.

Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

TL;DR

The paper tackles context-aware human affordance generation in complex 2D scenes, where no real person is present for supervision. It introduces a mutual cross-modal attention (MCMA) framework to fuse image and segmentation features into a rich scene context , enabling an automated four-stage pipeline: sample a probable location with a VAE conditioned on global context, select a pose template with a classifier, and independently sample scale and deformation with two dedicated VAEs before applying a target transformation. Key contributions include the MCMA context representation, a fully automated inference pipeline, and extensive ablations showing that semantic context and modular VAEs improve pose plausibility and alignment (PCK, PCKh, IOU) while achieving better qualitative realism and user preference. The approach demonstrates strong potential for downstream rendering, virtual humans in scenes, and synthetic data generation, with practical impact in AR/VR, graphics, and scene understanding.

Abstract

Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.

Paper Structure

This paper contains 16 sections, 15 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of the proposed method. Left: Predicted locations for a new person in the scene. Middle: Estimated scale at each predicted location. Right: Final human pose estimated after scaling and deformation at each predicted location.
  • Figure 2: Architecture of the Mutual Cross-Modal Attention (MCMA) block.
  • Figure 3: An illustration of the proposed architecture. The workflow is divided into four subnetworks to estimate the probable location $o^*$, pose template class $y^*$, scaling parameters $s^*$, and linear deformations $d^*$ of a potential target pose. Every subnetwork exclusively uses the proposed Mutual Cross-Modal Attention (MCMA) block to encode global and local scene contexts as shown in Fig. \ref{['fig:mcma_block']}.
  • Figure 4: Qualitative comparison of the proposed method with existing human affordance generation techniques by Wang et al.wang2017binge, Zhang et al.zhang2022inpaint, and Yao et al.yao2023scene.
  • Figure 5: Visualization of the learned distribution. (Left) Input scene. (Middle) Distribution of standing poses. (Right) Distribution of sitting poses.
  • ...and 3 more figures