Table of Contents
Fetching ...

JOintGS: Joint Optimization of Cameras, Bodies and 3D Gaussians for In-the-Wild Monocular Reconstruction

Zihan Lou, Jinlong Fan, Sihan Ma, Yuxiang Yang, Jing Zhang

TL;DR

JOintGS addresses the challenge of reconstructing high-fidelity animatable 3D human avatars from monocular in-the-wild videos by jointly optimizing camera extrinsics, SMPL poses, and 3D Gaussian fields. It introduces a foreground-background disentanglement and a synergistic refinement mechanism that uses static background Gaussians to anchor camera estimates, uses refined cameras to improve temporal pose correspondences, and uses accurate poses to improve foreground-background separation. It adds a temporal dynamics module for per-frame non-rigid deformations and a residual color field for illumination effects, enabling robust, temporally-consistent reconstructions. Experiments on NeuMan and EMDB show state-of-the-art PSNR improvements and robustness to noisy initialization, with real-time rendering.

Abstract

Reconstructing high-fidelity animatable 3D human avatars from monocular RGB videos remains challenging, particularly in unconstrained in-the-wild scenarios where camera parameters and human poses from off-the-shelf methods (e.g., COLMAP, HMR2.0) are often inaccurate. Splatting (3DGS) advances demonstrate impressive rendering quality and real-time performance, they critically depend on precise camera calibration and pose annotations, limiting their applicability in real-world settings. We present JOintGS, a unified framework that jointly optimizes camera extrinsics, human poses, and 3D Gaussian representations from coarse initialization through a synergistic refinement mechanism. Our key insight is that explicit foreground-background disentanglement enables mutual reinforcement: static background Gaussians anchor camera estimation via multi-view consistency; refined cameras improve human body alignment through accurate temporal correspondence; optimized human poses enhance scene reconstruction by removing dynamic artifacts from static constraints. We further introduce a temporal dynamics module to capture fine-grained pose-dependent deformations and a residual color field to model illumination variations. Extensive experiments on NeuMan and EMDB datasets demonstrate that JOintGS achieves superior reconstruction quality, with 2.1~dB PSNR improvement over state-of-the-art methods on NeuMan dataset, while maintaining real-time rendering. Notably, our method shows significantly enhanced robustness to noisy initialization compared to the baseline.Our source code is available at https://github.com/MiliLab/JOintGS.

JOintGS: Joint Optimization of Cameras, Bodies and 3D Gaussians for In-the-Wild Monocular Reconstruction

TL;DR

JOintGS addresses the challenge of reconstructing high-fidelity animatable 3D human avatars from monocular in-the-wild videos by jointly optimizing camera extrinsics, SMPL poses, and 3D Gaussian fields. It introduces a foreground-background disentanglement and a synergistic refinement mechanism that uses static background Gaussians to anchor camera estimates, uses refined cameras to improve temporal pose correspondences, and uses accurate poses to improve foreground-background separation. It adds a temporal dynamics module for per-frame non-rigid deformations and a residual color field for illumination effects, enabling robust, temporally-consistent reconstructions. Experiments on NeuMan and EMDB show state-of-the-art PSNR improvements and robustness to noisy initialization, with real-time rendering.

Abstract

Reconstructing high-fidelity animatable 3D human avatars from monocular RGB videos remains challenging, particularly in unconstrained in-the-wild scenarios where camera parameters and human poses from off-the-shelf methods (e.g., COLMAP, HMR2.0) are often inaccurate. Splatting (3DGS) advances demonstrate impressive rendering quality and real-time performance, they critically depend on precise camera calibration and pose annotations, limiting their applicability in real-world settings. We present JOintGS, a unified framework that jointly optimizes camera extrinsics, human poses, and 3D Gaussian representations from coarse initialization through a synergistic refinement mechanism. Our key insight is that explicit foreground-background disentanglement enables mutual reinforcement: static background Gaussians anchor camera estimation via multi-view consistency; refined cameras improve human body alignment through accurate temporal correspondence; optimized human poses enhance scene reconstruction by removing dynamic artifacts from static constraints. We further introduce a temporal dynamics module to capture fine-grained pose-dependent deformations and a residual color field to model illumination variations. Extensive experiments on NeuMan and EMDB datasets demonstrate that JOintGS achieves superior reconstruction quality, with 2.1~dB PSNR improvement over state-of-the-art methods on NeuMan dataset, while maintaining real-time rendering. Notably, our method shows significantly enhanced robustness to noisy initialization compared to the baseline.Our source code is available at https://github.com/MiliLab/JOintGS.
Paper Structure (24 sections, 15 equations, 6 figures, 2 tables)

This paper contains 24 sections, 15 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison with Previous Methods. Unlike existing approaches that assume fixed camera poses and SMPL parameters as inputs, our JOintGS performs unified joint optimization through a synergistic refinement mechanism.
  • Figure 2: JOintGS Framework Overview. Given a monocular RGB video with coarse camera poses $\boldsymbol{T}=[\boldsymbol{R}|\mathbf{t}]$ from COLMAP and initial SMPL parameters $\boldsymbol{\xi}=(\boldsymbol{\beta},\boldsymbol{\theta})$ from HMR2.0, we initialize scene Gaussians $\mathcal{G}_B$ (from COLMAP point cloud) and human Gaussians $\mathcal{G}_H$ (from SMPL vertices) in canonical space. Our synergistic refinement mechanism (highlighted by orange gradient flow) jointly optimizes camera pose corrections $\Delta\boldsymbol{T}$, SMPL parameter refinements $\Delta\boldsymbol{\xi}$, and Gaussian attributes $\{\mathcal{G}_H, \mathcal{G}_B\}$ through unified differentiable rendering supervision. The optimization operates through three complementary pathways: (1) Background-anchored camera refinement: static scene Gaussians provide multi-view geometric constraints via photometric loss $\mathcal{L}_B$ on background regions; (2) Camera-guided human optimization: refined cameras enable accurate temporal correspondence for SMPL parameter optimization via human rendering loss $\mathcal{L}_H$; (3) Pose-aware Gaussian optimization: improved camera and SMPL parameters enhance foreground-background disentanglement, facilitating Gaussian field optimization with photometric losses $\mathcal{L}_{\text{render}}$. This closed-loop mutual refinement enables robust reconstruction from noisy initialization without requiring pre-calibrated inputs.
  • Figure S1: Ablation study on the phased optimization schedule. From left to right: the Ground Truth reference, results of our full three-stage optimization schedule, and results of performing joint optimization only (without warm-up and independent stages).
  • Figure S2: JOintGS Model Architecture. Our model architecture is composed of one Encoder module and two Decoder modules. In the Encoder module, the position attributes and the global temporal attributes of the Gaussian functions are encoded into positional features and temporal features, respectively. The Decoder module receives the positional features as input and utilizes a two-layer MLP with GELU activation to output either appearance or geometry features. These features are then fed into corresponding prediction heads to derive the specific attribute values. For certain dynamic attributes, we opt to inject the temporal features into the second layer of the MLP and use the same prediction head to output the corresponding residual values.
  • Figure S3: Qualitative comparison on NeuMan dataset. For each scene, we present the complete rendered image (first column) and a zoomed-in view of a densely textured region (second column), along with the error map of different methods (last three columns).
  • ...and 1 more figures