Table of Contents
Fetching ...

StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset

Chaofan Huo, Ye Shi, Yuexin Ma, Lan Xu, Jingyi Yu, Jingya Wang

TL;DR

This work tackles monocular 3D human-object reconstruction by introducing Human-Object Offset (HO-offset), a dense vector $\mathbf{x}\in\mathbb{R}^{3mn}$ formed from offsets $\mathbf{d}_{i,j}=\mathbf{p}_j^{\text{o}}-\mathbf{p}_i^{\text{h}}$ between densely sampled human and object anchors. It then learns a compact latent HO relation space via PCA and uses StackFLOW, a stacked sequence of conditional normalizing flows, to infer the posterior distribution of HO relations from a single image, followed by a post-optimization that enforces 2D-3D consistency and HO-offset coherence. Key contributions include the HO-offset representation, a two-flow posterior inference framework conditioned on image and pose, and a reprojection-plus-offset optimization objective that yields accurate and physically plausible reconstructions. Evaluated on BEHAVE and InterCap, the approach achieves competitive accuracy and substantial speedups, particularly under heavy occlusion, highlighting the practical impact for real-world HOI understanding from monocular imagery.

Abstract

Modeling and capturing the 3D spatial arrangement of the human and the object is the key to perceiving 3D human-object interaction from monocular images. In this work, we propose to use the Human-Object Offset between anchors which are densely sampled from the surface of human mesh and object mesh to represent human-object spatial relation. Compared with previous works which use contact map or implicit distance filed to encode 3D human-object spatial relations, our method is a simple and efficient way to encode the highly detailed spatial correlation between the human and object. Based on this representation, we propose Stacked Normalizing Flow (StackFLOW) to infer the posterior distribution of human-object spatial relations from the image. During the optimization stage, we finetune the human body pose and object 6D pose by maximizing the likelihood of samples based on this posterior distribution and minimizing the 2D-3D corresponding reprojection loss. Extensive experimental results show that our method achieves impressive results on two challenging benchmarks, BEHAVE and InterCap datasets.

StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset

TL;DR

This work tackles monocular 3D human-object reconstruction by introducing Human-Object Offset (HO-offset), a dense vector formed from offsets between densely sampled human and object anchors. It then learns a compact latent HO relation space via PCA and uses StackFLOW, a stacked sequence of conditional normalizing flows, to infer the posterior distribution of HO relations from a single image, followed by a post-optimization that enforces 2D-3D consistency and HO-offset coherence. Key contributions include the HO-offset representation, a two-flow posterior inference framework conditioned on image and pose, and a reprojection-plus-offset optimization objective that yields accurate and physically plausible reconstructions. Evaluated on BEHAVE and InterCap, the approach achieves competitive accuracy and substantial speedups, particularly under heavy occlusion, highlighting the practical impact for real-world HOI understanding from monocular imagery.

Abstract

Modeling and capturing the 3D spatial arrangement of the human and the object is the key to perceiving 3D human-object interaction from monocular images. In this work, we propose to use the Human-Object Offset between anchors which are densely sampled from the surface of human mesh and object mesh to represent human-object spatial relation. Compared with previous works which use contact map or implicit distance filed to encode 3D human-object spatial relations, our method is a simple and efficient way to encode the highly detailed spatial correlation between the human and object. Based on this representation, we propose Stacked Normalizing Flow (StackFLOW) to infer the posterior distribution of human-object spatial relations from the image. During the optimization stage, we finetune the human body pose and object 6D pose by maximizing the likelihood of samples based on this posterior distribution and minimizing the 2D-3D corresponding reprojection loss. Extensive experimental results show that our method achieves impressive results on two challenging benchmarks, BEHAVE and InterCap datasets.
Paper Structure (25 sections, 18 equations, 3 figures, 4 tables)

This paper contains 25 sections, 18 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Human-Object Offset $\mathbf{d}_{i,j}$ describes how far from the human anchor point $\mathbf{p}_i^\text{h}$ to object anchor point $\mathbf{p}_j^\text{o}$ through the direction of the vector $\mathbf{d}_{i,j}$. They are calculated between two sets of anchors which are densely sampled from the surface of human mesh and object mesh beforehand. The dense offset captures a highly-detailed correlation between human parts and object parts. It is a quantitative representation to encode the 3D spatial relationship between the human and the object given human-object interaction instance.
  • Figure 2: Main framework for our method. (a) We use human-object offset to encode the spatial relation between the human and the object. For a human-object pair, offsets are calculated and flattened into an offset vector $\mathbf{x}$. Based on all offset vectors calculated from training set, the latent spatial relation space is constructed using principle component analysis. To get a vectorized representation for human-object spatial relation, the offset vector is projected into this latent spatial relation space by linear projection. Inversely, given a sample $\gamma$ from this latent spatial relation space, we can reproject it to recover offset vector $\hat{\mathbf{x}}$. The human-object instance can be reconstructed from $\hat{\mathbf{x}}$ by iterative optimization. (b) With pre-constructed latent spatial relation space, we use stacked normalizing flow to infer the posteriori distribution of human-object spatial relation for an input image. The details are shown in Sec. \ref{['section:distribution']}. (c) In post-optimization stage, we further finetune the reconstruction results using 2D-3D reprojection loss and offset loss which is illustrated in Sec. \ref{['section:optimization']}.
  • Figure 3: Visualized reconstruction results on BEHAVE dataset. The red regions depict the contact region in BSTRO or the relative distance of our method. The red circles mark the incorrect reconstruction results. These results show that our method performs well in some heavy occlusion cases.