Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation
Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein
TL;DR
The paper tackles context-aware human affordance generation in complex 2D scenes, where no real person is present for supervision. It introduces a mutual cross-modal attention (MCMA) framework to fuse image and segmentation features into a rich scene context $F^{context}$, enabling an automated four-stage pipeline: sample a probable location with a VAE conditioned on global context, select a pose template with a classifier, and independently sample scale $s^*$ and deformation $d^*$ with two dedicated VAEs before applying a target transformation. Key contributions include the MCMA context representation, a fully automated inference pipeline, and extensive ablations showing that semantic context and modular VAEs improve pose plausibility and alignment (PCK, PCKh, IOU) while achieving better qualitative realism and user preference. The approach demonstrates strong potential for downstream rendering, virtual humans in scenes, and synthetic data generation, with practical impact in AR/VR, graphics, and scene understanding.
Abstract
Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.
