Patient4D: Temporally Consistent Patient Body Mesh Recovery from Monocular Operating Room Video

Mingxiao Tu; Hoijoon Jung; Alireza Moghadam; Andre Kyme; Jinman Kim

Patient4D: Temporally Consistent Patient Body Mesh Recovery from Monocular Operating Room Video

Mingxiao Tu, Hoijoon Jung, Alireza Moghadam, Andre Kyme, Jinman Kim

Abstract

Recovering a dense 3D body mesh from monocular video remains challenging under occlusion from draping and continuously moving camera viewpoints. This configuration arises in surgical augmented reality (AR), where an anesthetized patient lies under surgical draping while a surgeon's head-mounted camera continuously changes viewpoint. Existing human mesh recovery (HMR) methods are typically trained on upright, moving subjects captured from relatively stable cameras, leading to performance degradation under such conditions. To address this, we present Patient4D, a stationarity-constrained reconstruction pipeline that explicitly exploits the stationarity prior. The pipeline combines image-level foundation models for perception with lightweight geometric mechanisms that enforce temporal consistency across frames. Two key components enable robust reconstruction: Pose Locking, which anchors pose parameters using stable keyframes, and Rigid Fallback, which recovers meshes under severe occlusion through silhouette-guided rigid alignment. Together, these mechanisms stabilize predictions while remaining compatible with off-the-shelf HMR models. We evaluate Patient4D on 4,680 synthetic surgical sequences and three public HMR video benchmarks. Under surgical drape occlusion, Patient4D achieves a 0.75 mean IoU, reducing failure frames from 30.5% to 1.3% compared to the best baseline. Our findings demonstrate that exploiting stationarity priors can substantially improve monocular reconstruction in clinical AR scenarios.

Patient4D: Temporally Consistent Patient Body Mesh Recovery from Monocular Operating Room Video

Abstract

Paper Structure (20 sections, 5 equations, 7 figures, 7 tables)

This paper contains 20 sections, 5 equations, 7 figures, 7 tables.

Introduction
Related Work
Monocular Human Mesh Recovery
Vision Foundation Models for 3D Perception
Surgical AR and Patient Registration
Methodology
Patient Segmentation
Per-Frame Human Mesh Recovery
Pose Locking
Rigid Fallback
Application to Surgical AR
Experiments
Setup
Simulation Results and Analysis
Evaluation on Standard HMR Video Benchmarks
...and 5 more sections

Figures (7)

Figure 1: Overview of the Patient4D pipeline. Given a monocular surgical video and a single point prompt, SAM3 produces per-frame body masks while MoGe estimates camera intrinsics from the first frame. SAM3DBody then recovers initial per-frame SMPL meshes from the masked regions; body shape $\boldsymbol{\beta}$ and scale are locked to a single keyframe estimate and broadcast to all frames. To enforce temporal consistency, Pose Locking anchors body pose parameters to a median value calculated across the sequence. Frames whose mesh-mask IoU falls below a threshold ($\tau_\text{IoU}{=}0.6$) are corrected by Rigid Fallback, which selects the best reference mesh from a keyframe pool and optimizes a rigid transformation to align its projection with the current mask via a Dice objective.
Figure 2: Rigid Fallback on a challenging frame. Left: per-frame HMR produces a misaligned mesh (IoU$=$0.21) when the camera moves to an oblique viewpoint with partial visibility. Right: Rigid Fallback takes the best reference mesh from the keyframe pool and optimizes rotation and camera translation to align its projection with the SAM3 mask via the Dice objective, recovering correct alignment (IoU$=$0.93). Bottom row: mask overlap (green$=$projected mesh, blue$=$SAM3 mask, cyan$=$intersection).
Figure 3: Representative frames from the simulation datasets. Top: Sim-Geometry sequences rendered from SLP-3Dfits with ground truth SMPL meshes, generated patient texture and orbiting camera silhouettes. Bottom: Sim-Visual frames at three drape coverage levels (none, partial, heavy), depicting realistic surgical scenes with photorealistic skin textures and operating room environments. These two are complementary: Sim-Geometry shows clean silhouettes with ground truth SMPL meshes whereas Sim-Visual demonstrates increasing visual complexity from drapes.
Figure 4: Qualitative comparison of mesh recovery on a synthetic surgical sequence. The patient is positioned in a lateral decubitus pose under no draping. As the camera orbits from the initial viewpoint (Frames 10, 50) to extreme oblique angles (Frames 150, 200), the 2D visual cues become ambiguous. Standard per-frame and video-based baselines fail to recognize the stationarity prior of the patient; they attempt to fit the changing 2D contours by distorting the 3D anatomy (e.g., unnaturally twisting the spine or flattening the torso), resulting in plummeting Mesh-Mask IoU. In contrast, Patient4D (orange) enforces temporal consistency, preserving the correct 3D geometry and maintaining strong spatial alignment across the entire camera trajectory.
Figure 5: Per-frame Mesh-Mask IoU ($\uparrow$) over time for one simulation video. At approximately frame 260, the sequence introduces a severe camera viewpoint transition coupled with heavy drape occlusion. While all baseline methods suffer failure frames (IoU dropping below 0.2), Patient4D successfully triggers Rigid Fallback, retrieving a high-quality historical reference to immediately recover and maintain high spatial alignment (IoU $\approx$ 0.9).
...and 2 more figures

Patient4D: Temporally Consistent Patient Body Mesh Recovery from Monocular Operating Room Video

Abstract

Patient4D: Temporally Consistent Patient Body Mesh Recovery from Monocular Operating Room Video

Authors

Abstract

Table of Contents

Figures (7)