Table of Contents
Fetching ...

Unsupervised View-Invariant Human Posture Representation

Faegheh Sardari, Björn Ommer, Majid Mirmehdi

TL;DR

The paper tackles the challenge of learning view-invariant 3D human pose representations without relying on annotated 3D skeleton data or camera parameters. It introduces a dual-encoder auto-encoder with a view-invariant pose encoder $E_{\odot}$ and a viewpoint encoder $E_{\sphericalangle}$, plus a decoder $D$, optimized via a composite loss $\mathcal{L}_{total} = \alpha \mathcal{L}_{invar} + \beta \mathcal{L}_{equiv} + \gamma (\mathcal{L}_{rec1} + \mathcal{L}_{rec2})$ with $\alpha=1.0$, $\beta=0.001$, $\gamma=1.0$. Training enforces canonical pose extraction using a view-invariant loss $\mathcal{L}_{invar}$ computed from simultaneous frames across views and an equivariance loss $\mathcal{L}_{equiv}$ from augmented frames with pixel shifts, along with reconstruction losses, yielding canonical pose features ${P}^{I}_{\odot} \in \mathbb{R}^{3\times N}$ where $N=70$. The approach achieves strong unsupervised cross-view action recognition on NTU RGB+D (RGB $\approx 74.8\%$, depth $\approx 67.5\%$ for CV; RGB $\approx 64.3\%$, depth $\approx 64.7\%$ for CS) and unsupervised cross-view/cross-subject movement analysis on QMAR (SRC $\approx 0.54$ CV, $\approx 0.58$ CS), with further gains when fine-tuning supervised baselines. Ablation confirms the necessity of both $\mathcal{L}_{invar}$ and $\mathcal{L}_{equiv}$ for robust performance, and the method shows transferability to new domains, highlighting its practical potential for settings where multi-view data and 3D annotations are scarce.

Abstract

Most recent view-invariant action recognition and performance assessment approaches rely on a large amount of annotated 3D skeleton data to extract view-invariant features. However, acquiring 3D skeleton data can be cumbersome, if not impractical, in in-the-wild scenarios. To overcome this problem, we present a novel unsupervised approach that learns to extract view-invariant 3D human pose representation from a 2D image without using 3D joint data. Our model is trained by exploiting the intrinsic view-invariant properties of human pose between simultaneous frames from different viewpoints and their equivariant properties between augmented frames from the same viewpoint. We evaluate the learned view-invariant pose representations for two downstream tasks. We perform comparative experiments that show improvements on the state-of-the-art unsupervised cross-view action classification accuracy on NTU RGB+D by a significant margin, on both RGB and depth images. We also show the efficiency of transferring the learned representations from NTU RGB+D to obtain the first ever unsupervised cross-view and cross-subject rank correlation results on the multi-view human movement quality dataset, QMAR, and marginally improve on the-state-of-the-art supervised results for this dataset. We also carry out ablation studies to examine the contributions of the different components of our proposed network.

Unsupervised View-Invariant Human Posture Representation

TL;DR

The paper tackles the challenge of learning view-invariant 3D human pose representations without relying on annotated 3D skeleton data or camera parameters. It introduces a dual-encoder auto-encoder with a view-invariant pose encoder and a viewpoint encoder , plus a decoder , optimized via a composite loss with , , . Training enforces canonical pose extraction using a view-invariant loss computed from simultaneous frames across views and an equivariance loss from augmented frames with pixel shifts, along with reconstruction losses, yielding canonical pose features where . The approach achieves strong unsupervised cross-view action recognition on NTU RGB+D (RGB , depth for CV; RGB , depth for CS) and unsupervised cross-view/cross-subject movement analysis on QMAR (SRC CV, CS), with further gains when fine-tuning supervised baselines. Ablation confirms the necessity of both and for robust performance, and the method shows transferability to new domains, highlighting its practical potential for settings where multi-view data and 3D annotations are scarce.

Abstract

Most recent view-invariant action recognition and performance assessment approaches rely on a large amount of annotated 3D skeleton data to extract view-invariant features. However, acquiring 3D skeleton data can be cumbersome, if not impractical, in in-the-wild scenarios. To overcome this problem, we present a novel unsupervised approach that learns to extract view-invariant 3D human pose representation from a 2D image without using 3D joint data. Our model is trained by exploiting the intrinsic view-invariant properties of human pose between simultaneous frames from different viewpoints and their equivariant properties between augmented frames from the same viewpoint. We evaluate the learned view-invariant pose representations for two downstream tasks. We perform comparative experiments that show improvements on the state-of-the-art unsupervised cross-view action classification accuracy on NTU RGB+D by a significant margin, on both RGB and depth images. We also show the efficiency of transferring the learned representations from NTU RGB+D to obtain the first ever unsupervised cross-view and cross-subject rank correlation results on the multi-view human movement quality dataset, QMAR, and marginally improve on the-state-of-the-art supervised results for this dataset. We also carry out ablation studies to examine the contributions of the different components of our proposed network.

Paper Structure

This paper contains 5 sections, 6 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Left: the proposed network learns to disentangle canonical 3D human pose representations and view-dependent features through simultaneous frames from different views and augmented frames from the same view. Right: the unsupervised learned canonical pose representation can be used for downstream tasks.
  • Figure 2: The overall schema of the proposed view-invariant posture representation learning architecture.