Table of Contents
Fetching ...

FOF-X: Towards Real-time Detailed Human Reconstruction from a Single Image

Qiao Feng, Yuanwang Yang, Yebin Liu, Yu-Kun Lai, Jingyu Yang, Kun Li

TL;DR

This work tackles real-time monocular 3D human reconstruction by introducing Fourier Occupancy Field (FOF), which represents the 3D occupancy $F: [-1,1]^3 \to \{0,0.5,1\}$ as a 2D coefficient field via a truncated basis along the $z$-axis, enabling efficient CNN processing. To improve robustness and avoid Gibbs artifacts, FOF-X adopts a cosine-series formulation, leverages dual-sided normal maps and an SMPL prior, and incorporates robust inter-conversion between FOF and meshes using an automaton-based discontinuity matcher and a Laplacian coordinate constraint. The proposed pipeline delivers real-time performance (over 30 FPS, e.g., about $0.02$ s per frame on capable GPUs) and state-of-the-art accuracy on THuman2.1, CAPE, and CustomHumans, while remaining compatible with traditional mesh pipelines. This framework effectively bridges 2D image processing and 3D geometry, offering a scalable, cross-domain representation that supports robust, high-fidelity reconstruction from a single image. Future work includes extending the approach to perspective-camera setups, handling very thin structures, and exploring broader scene-level representations.

Abstract

We introduce FOF-X for real-time reconstruction of detailed human geometry from a single image. Balancing real-time speed against high-quality results is a persistent challenge, mainly due to the high computational demands of existing 3D representations. To address this, we propose Fourier Occupancy Field (FOF), an efficient 3D representation by learning the Fourier series. The core of FOF is to factorize a 3D occupancy field into a 2D vector field, retaining topology and spatial relationships within the 3D domain while facilitating compatibility with 2D convolutional neural networks. Such a representation bridges the gap between 3D and 2D domains, enabling the integration of human parametric models as priors and enhancing the reconstruction robustness. Based on FOF, we design a new reconstruction framework, FOF-X, to avoid the performance degradation caused by texture and lighting. This enables our real-time reconstruction system to better handle the domain gap between training images and real images. Additionally, in FOF-X, we enhance the inter-conversion algorithms between FOF and mesh representations with a Laplacian constraint and an automaton-based discontinuity matcher, improving both quality and robustness. We validate the strengths of our approach on different datasets and real-captured data, where FOF-X achieves new state-of-the-art results. The code has already been released for research purposes at https://cic.tju.edu.cn/faculty/likun/projects/FOFX/index.html.

FOF-X: Towards Real-time Detailed Human Reconstruction from a Single Image

TL;DR

This work tackles real-time monocular 3D human reconstruction by introducing Fourier Occupancy Field (FOF), which represents the 3D occupancy as a 2D coefficient field via a truncated basis along the -axis, enabling efficient CNN processing. To improve robustness and avoid Gibbs artifacts, FOF-X adopts a cosine-series formulation, leverages dual-sided normal maps and an SMPL prior, and incorporates robust inter-conversion between FOF and meshes using an automaton-based discontinuity matcher and a Laplacian coordinate constraint. The proposed pipeline delivers real-time performance (over 30 FPS, e.g., about s per frame on capable GPUs) and state-of-the-art accuracy on THuman2.1, CAPE, and CustomHumans, while remaining compatible with traditional mesh pipelines. This framework effectively bridges 2D image processing and 3D geometry, offering a scalable, cross-domain representation that supports robust, high-fidelity reconstruction from a single image. Future work includes extending the approach to perspective-camera setups, handling very thin structures, and exploring broader scene-level representations.

Abstract

We introduce FOF-X for real-time reconstruction of detailed human geometry from a single image. Balancing real-time speed against high-quality results is a persistent challenge, mainly due to the high computational demands of existing 3D representations. To address this, we propose Fourier Occupancy Field (FOF), an efficient 3D representation by learning the Fourier series. The core of FOF is to factorize a 3D occupancy field into a 2D vector field, retaining topology and spatial relationships within the 3D domain while facilitating compatibility with 2D convolutional neural networks. Such a representation bridges the gap between 3D and 2D domains, enabling the integration of human parametric models as priors and enhancing the reconstruction robustness. Based on FOF, we design a new reconstruction framework, FOF-X, to avoid the performance degradation caused by texture and lighting. This enables our real-time reconstruction system to better handle the domain gap between training images and real images. Additionally, in FOF-X, we enhance the inter-conversion algorithms between FOF and mesh representations with a Laplacian constraint and an automaton-based discontinuity matcher, improving both quality and robustness. We validate the strengths of our approach on different datasets and real-captured data, where FOF-X achieves new state-of-the-art results. The code has already been released for research purposes at https://cic.tju.edu.cn/faculty/likun/projects/FOFX/index.html.

Paper Structure

This paper contains 29 sections, 9 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Left: Our FOF-X can reconstruct 3D human shapes from a live video stream with a real-time speed of over 30 FPS. Right: Compared to the original FOF-SMPL, our FOF-X demonstrates better robustness to variations in texture and lighting. Under challenging lighting conditions, such as strong illumination or shadows, FOF-X produces more detailed and accurate reconstructions (first two rows). FOF-X effectively avoids the incorrect reconstruction caused by textures, such as the stripe pattern on the edge of the shirt (third row). Note that FOF-SMPL is not real-time.
  • Figure 2: FOF and meshes can be inter-converted flexibly. The newly designed inter-conversion algorithms in FOF-X exhibit better robustness and quality. Our automaton-based discontinuity matcher eliminates floating artifacts during the conversion process (first row). With the Laplacian coordinate constraint, we resolve stair-step artifacts on the recovered meshes (second row).
  • Figure 3: The overall pipeline of FOF-X for monocular real-time human reconstruction. FOF-X takes an RGB image as input and exploits a SMPL body mesh as a prior with the proposed mesh-to-FOF conversion algorithm (Sec. \ref{['convertion1']}), which includes an automaton-based discontinuity matcher to ensure robustness. Based on the rendered SMPL normal maps and input RGB image, the dual-sided normal maps are predicted as the internal representation and decoded to FOF with the SMPL prior through an image-to-image network (Sec. \ref{['learn']}). The FOF representation (Sec. \ref{['formulation']}) is then converted to a mesh with the FOF-to-mesh inversion module (Sec. \ref{['convertion2']}), incorporating a Laplacian coordinate constraint to enhance the quality of the output mesh.
  • Figure 4: Automaton-based discontinuity matcher used to process discontinuities on each pixel. The results of with and without Automaton-based discontinuity matcher are shown in Fig. \ref{['imp']}.
  • Figure 5: Example of Discontinuity Point Matching via Automaton.
  • ...and 4 more figures