Table of Contents
Fetching ...

Social EgoMesh Estimation

Luca Scofano, Alessio Sampieri, Edoardo De Matteis, Indro Spinelli, Fabio Galasso

TL;DR

This in-depth study sheds light on when social interaction matters most for ego-mesh estimation; it quantifies the impact of interpersonal distance and gaze direction and surpasses the current best technique, reducing the pose estimation error (MPJPE) by 53%.

Abstract

Accurately estimating the 3D pose of the camera wearer in egocentric video sequences is crucial to modeling human behavior in virtual and augmented reality applications. The task presents unique challenges due to the limited visibility of the user's body caused by the front-facing camera mounted on their head. Recent research has explored the utilization of the scene and ego-motion, but it has overlooked humans' interactive nature. We propose a novel framework for Social Egocentric Estimation of body MEshes (SEE-ME). Our approach is the first to estimate the wearer's mesh using only a latent probabilistic diffusion model, which we condition on the scene and, for the first time, on the social wearer-interactee interactions. Our in-depth study sheds light on when social interaction matters most for ego-mesh estimation; it quantifies the impact of interpersonal distance and gaze direction. Overall, SEE-ME surpasses the current best technique, reducing the pose estimation error (MPJPE) by 53%. The code is available at https://github.com/L-Scofano/SEEME.

Social EgoMesh Estimation

TL;DR

This in-depth study sheds light on when social interaction matters most for ego-mesh estimation; it quantifies the impact of interpersonal distance and gaze direction and surpasses the current best technique, reducing the pose estimation error (MPJPE) by 53%.

Abstract

Accurately estimating the 3D pose of the camera wearer in egocentric video sequences is crucial to modeling human behavior in virtual and augmented reality applications. The task presents unique challenges due to the limited visibility of the user's body caused by the front-facing camera mounted on their head. Recent research has explored the utilization of the scene and ego-motion, but it has overlooked humans' interactive nature. We propose a novel framework for Social Egocentric Estimation of body MEshes (SEE-ME). Our approach is the first to estimate the wearer's mesh using only a latent probabilistic diffusion model, which we condition on the scene and, for the first time, on the social wearer-interactee interactions. Our in-depth study sheds light on when social interaction matters most for ego-mesh estimation; it quantifies the impact of interpersonal distance and gaze direction. Overall, SEE-ME surpasses the current best technique, reducing the pose estimation error (MPJPE) by 53%. The code is available at https://github.com/L-Scofano/SEEME.

Paper Structure

This paper contains 24 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (Left) Frame from the input egocentric video stream. We experience the immersive subjective perspective of the front-facing camera wearer, but the wearer is behind the wearable and, therefore, invisible. Still, we recognize the points of interest of the wearer, parts of the scene where the action happens, and, most importantly, the interactee engaged in communication with the wearer. (Right) Third-view reconstruction of the ego mesh of the camera wearer by our proposed SEE-ME. Our vantage point reveals the surrounding environment, featuring a sofa and a person from an overhead perspective, leading us to infer the wearer's likely standing position. Specifically, vicinity and gaze interactions are important cues for our reconstruction as we experimentally quantify.
  • Figure 2: SEE-ME framework. On the left, we present VAE's training to learn a meaningful latent space by solving reconstruction tasks. On the right, we extract and process our conditioning strategies. Corresponding to the 3D point cloud representation of the scene and the interactee's pose extracted from the video sequence. After the conditional denoising process, we can output a SMPL representation of the wearer's pose.
  • Figure 3: Front view of 3 frames extracted from an egocentric sequence. We compare SEE-ME (blue) with EgoEgo (yellow), and the ground truth (green). In red we have the interactee's poses, extracted from the egocentric video sequence next to it, but not used in our model.
  • Figure 4: The interactee (red) influences the wearer's motion (blue).