Table of Contents
Fetching ...

PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation

Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, Grégory Rogez

TL;DR

A new transformer-based model is introduced, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities, and outperforms a standard multi-modal alignment retrieval model when composing modalities.

Abstract

Aligning multiple modalities in a latent space, such as images and texts, has shown to produce powerful semantic visual representations, fueling tasks like image captioning, text-to-image generation, or image grounding. In the context of human-centric vision, albeit CLIP-like representations encode most standard human poses relatively well (such as standing or sitting), they lack sufficient acuteness to discern detailed or uncommon ones. Actually, while 3D human poses have been often associated with images (e.g. to perform pose estimation or pose-conditioned image generation), or more recently with text (e.g. for text-to-pose generation), they have seldom been paired with both. In this work, we combine 3D poses, person's pictures and textual pose descriptions to produce an enhanced 3D-, visual- and semantic-aware human pose representation. We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities. When composing modalities, it outperforms a standard multi-modal alignment retrieval model, making it possible to sort out partial information (e.g. image with the lower body occluded). We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation, which consists in generating a text that describes how to move from one 3D pose to another (as a fitness coach). Unlike prior works, our model can take any kind of input (image and/or pose) without retraining.

PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation

TL;DR

A new transformer-based model is introduced, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities, and outperforms a standard multi-modal alignment retrieval model when composing modalities.

Abstract

Aligning multiple modalities in a latent space, such as images and texts, has shown to produce powerful semantic visual representations, fueling tasks like image captioning, text-to-image generation, or image grounding. In the context of human-centric vision, albeit CLIP-like representations encode most standard human poses relatively well (such as standing or sitting), they lack sufficient acuteness to discern detailed or uncommon ones. Actually, while 3D human poses have been often associated with images (e.g. to perform pose estimation or pose-conditioned image generation), or more recently with text (e.g. for text-to-pose generation), they have seldom been paired with both. In this work, we combine 3D poses, person's pictures and textual pose descriptions to produce an enhanced 3D-, visual- and semantic-aware human pose representation. We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities. When composing modalities, it outperforms a standard multi-modal alignment retrieval model, making it possible to sort out partial information (e.g. image with the lower body occluded). We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation, which consists in generating a text that describes how to move from one 3D pose to another (as a fitness coach). Unlike prior works, our model can take any kind of input (image and/or pose) without retraining.
Paper Structure (16 sections, 3 equations, 12 figures, 3 tables)

This paper contains 16 sections, 3 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Motivation. Comprehending a complex 3D object in a 2D world is not simple. Having access to several of its shadows, obtained by lighting it under different angles, can help better understand it. Similarly, we collect several multi-modal (and naturally partial) observations of the human pose (the "shadows"), and try to create an enriched pose embedding (the "3D" object). This embedding is derived from 3D joint rotations, pictures of humans and pose descriptions, then further used in downstream applications requiring human pose understanding.
  • Figure 2: The PoseEmbroider framework. Each modality is encoded independently by an encoder (left). The PoseEmbroider (right) is a transformer-based model, taking a varying set of modality inputs. It produces a visual-, 3D-, semantic-aware pose representation $\bar{x}$, by embroidering together available inputs. The model is trained using uni-modal contrastive losses between the modality-specific reprojections $\hat{m} \in \{\hat{v}, \hat{p}, \hat{t}\}$ of $\bar{x}$ and the original modality encodings $m \in \{v,p,t\}$. The total objective function accounts for various $\bar{x}_G$, obtained from the set $G$ of input modalities. $x$ and $e_m$ are learnable tokens, '+' denotes an addition.
  • Figure 3: Qualitative examples of any-to-any multi-modal retrieval on the validation split of BEDLAM-Script, for diverse input and output modalities.
  • Figure 4: Qualitative examples of edited-retrieval in a multi-modal setting on BEDLAM-Script. Texts specify new traits with respect to the original pose shown in the image. Artificial occlusion is created by overlaying a black rectangle on the image.
  • Figure 5: The pose instruction generation model. We train the model on pairs of poses $(p_A,p_B)$ and use our frozen PoseEmbroider to encode them. These two embeddings are fused with TIRG vo2019composing, whose output is used to condition an auto-regressive transformer text decoder via cross-attentions. At test time, the trained model can be directly applied on poses, images or a mix of both.
  • ...and 7 more figures