Table of Contents
Fetching ...

Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision

David C. Jeong, Aditya Puranik, James Vong, Vrushabh Abhijit Deogirikar, Ryan Fell, Julianna Dietrich, Maria Kyrarini, Christopher Kitts

TL;DR

Fish2Mesh tackles 3D human mesh recovery from egocentric fisheye imagery, addressing distortion and occlusion by introducing a fisheye-aware transformer with Egocentric Position Embedding (EPE). The method leverages a Swin Transformer backbone and multi-task heads to regress SMPL parameters and camera transforms, guided by a loss that enforces 3D–2D consistency. A key contribution is the EPE, built on an equirectangular projection to embed discretized 3D coordinates, which together with dataset augmentation (including 4D-Human supervision and prompt-based collection) yields state-of-the-art results measured by $PA ext{-}MPJPE$ and $PA ext{-}MPVPE$ across Ego4D, ECHP, and expanded data. The work demonstrates robust performance under self-occlusion and partial-view conditions, with practical impact for XR, assistive robotics, and immersive media where accurate egocentric mesh reconstructions from distorted inputs are crucial.

Abstract

Egocentric human body estimation allows for the inference of user body pose and shape from a wearable camera's first-person perspective. Although research has used pose estimation techniques to overcome self-occlusions and image distortions caused by head-mounted fisheye images, similar advances in 3D human mesh recovery (HMR) techniques have been limited. We introduce Fish2Mesh, a fisheye-aware transformer-based model designed for 3D egocentric human mesh recovery. We propose an egocentric position embedding block to generate an ego-specific position table for the Swin Transformer to reduce fisheye image distortion. Our model utilizes multi-task heads for SMPL parametric regression and camera translations, estimating 3D and 2D joints as auxiliary loss to support model training. To address the scarcity of egocentric camera data, we create a training dataset by employing the pre-trained 4D-Human model and third-person cameras for weak supervision. Our experiments demonstrate that Fish2Mesh outperforms previous state-of-the-art 3D HMR models.

Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision

TL;DR

Fish2Mesh tackles 3D human mesh recovery from egocentric fisheye imagery, addressing distortion and occlusion by introducing a fisheye-aware transformer with Egocentric Position Embedding (EPE). The method leverages a Swin Transformer backbone and multi-task heads to regress SMPL parameters and camera transforms, guided by a loss that enforces 3D–2D consistency. A key contribution is the EPE, built on an equirectangular projection to embed discretized 3D coordinates, which together with dataset augmentation (including 4D-Human supervision and prompt-based collection) yields state-of-the-art results measured by and across Ego4D, ECHP, and expanded data. The work demonstrates robust performance under self-occlusion and partial-view conditions, with practical impact for XR, assistive robotics, and immersive media where accurate egocentric mesh reconstructions from distorted inputs are crucial.

Abstract

Egocentric human body estimation allows for the inference of user body pose and shape from a wearable camera's first-person perspective. Although research has used pose estimation techniques to overcome self-occlusions and image distortions caused by head-mounted fisheye images, similar advances in 3D human mesh recovery (HMR) techniques have been limited. We introduce Fish2Mesh, a fisheye-aware transformer-based model designed for 3D egocentric human mesh recovery. We propose an egocentric position embedding block to generate an ego-specific position table for the Swin Transformer to reduce fisheye image distortion. Our model utilizes multi-task heads for SMPL parametric regression and camera translations, estimating 3D and 2D joints as auxiliary loss to support model training. To address the scarcity of egocentric camera data, we create a training dataset by employing the pre-trained 4D-Human model and third-person cameras for weak supervision. Our experiments demonstrate that Fish2Mesh outperforms previous state-of-the-art 3D HMR models.

Paper Structure

This paper contains 25 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The Fish2Mesh pipeline enables accurate 3D mesh recovery from egocentric fisheye perspectives. From left to right: (1) Fisheye Input with a wide field of view; (2) Third-person view (shown for context, not used as input); (3) Predicted (blue) vs. Ground Truth (red) vertices demonstrating near-complete overlap; and (4) Reconstructed 3D human mesh model from the fisheye input.
  • Figure 2: The architecture of the Fish2Mesh transformer model. The SMPL parameters $\Theta_{s}, \Theta_{p}$ are calculated to recover the human mesh, where $\Theta_{s}$ and $\Theta_{p}$ refer to the shape parameters and pose parameters respectively.
  • Figure 3: Equirectangular projection from spherical image.
  • Figure 4: Visual results of four examples from the four datasets, showing ground truth (red) and related models (blue). FisheyeViT is a pose estimation model, so we visualize the skeleton to compare the resulting joints. The third-person view is not used as model input and is provided purely as an environmental reference.