Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision

David C. Jeong; Aditya Puranik; James Vong; Vrushabh Abhijit Deogirikar; Ryan Fell; Julianna Dietrich; Maria Kyrarini; Christopher Kitts

Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision

David C. Jeong, Aditya Puranik, James Vong, Vrushabh Abhijit Deogirikar, Ryan Fell, Julianna Dietrich, Maria Kyrarini, Christopher Kitts

TL;DR

Fish2Mesh tackles 3D human mesh recovery from egocentric fisheye imagery, addressing distortion and occlusion by introducing a fisheye-aware transformer with Egocentric Position Embedding (EPE). The method leverages a Swin Transformer backbone and multi-task heads to regress SMPL parameters and camera transforms, guided by a loss that enforces 3D–2D consistency. A key contribution is the EPE, built on an equirectangular projection to embed discretized 3D coordinates, which together with dataset augmentation (including 4D-Human supervision and prompt-based collection) yields state-of-the-art results measured by $PA ext{-}MPJPE$ and $PA ext{-}MPVPE$ across Ego4D, ECHP, and expanded data. The work demonstrates robust performance under self-occlusion and partial-view conditions, with practical impact for XR, assistive robotics, and immersive media where accurate egocentric mesh reconstructions from distorted inputs are crucial.

Abstract

Egocentric human body estimation allows for the inference of user body pose and shape from a wearable camera's first-person perspective. Although research has used pose estimation techniques to overcome self-occlusions and image distortions caused by head-mounted fisheye images, similar advances in 3D human mesh recovery (HMR) techniques have been limited. We introduce Fish2Mesh, a fisheye-aware transformer-based model designed for 3D egocentric human mesh recovery. We propose an egocentric position embedding block to generate an ego-specific position table for the Swin Transformer to reduce fisheye image distortion. Our model utilizes multi-task heads for SMPL parametric regression and camera translations, estimating 3D and 2D joints as auxiliary loss to support model training. To address the scarcity of egocentric camera data, we create a training dataset by employing the pre-trained 4D-Human model and third-person cameras for weak supervision. Our experiments demonstrate that Fish2Mesh outperforms previous state-of-the-art 3D HMR models.

Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision

TL;DR

Abstract

Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)