Table of Contents
Fetching ...

PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis

Zhilin Guo, Jing Yang, Kyle Fogarty, Jingyi Wan, Boqiao Zhang, Tianhao Wu, Weihao Xia, Chenliang Zhou, Sakar Khattar, Fangcheng Zhong, Cristina Nader Vasconcelos, Cengiz Oztireli

TL;DR

PoseCraft is presented, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, it encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention.

Abstract

Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, we encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention. Our approach preserves 3D semantics by avoiding 2D re-projection ambiguity under large pose and viewpoint changes, and produces photorealistic imagery that faithfully captures identity and appearance. To train and evaluate at scale, we also implement GenHumanRF, a data generation workflow that renders diverse supervision from volumetric reconstructions. Our experiments show that PoseCraft achieves significant perceptual quality improvement over diffusion-centric methods, and attains better or comparable metrics to latest volumetric rendering SOTA while better preserving fabric and hair details.

PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis

TL;DR

PoseCraft is presented, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, it encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention.

Abstract

Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, we encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention. Our approach preserves 3D semantics by avoiding 2D re-projection ambiguity under large pose and viewpoint changes, and produces photorealistic imagery that faithfully captures identity and appearance. To train and evaluate at scale, we also implement GenHumanRF, a data generation workflow that renders diverse supervision from volumetric reconstructions. Our experiments show that PoseCraft achieves significant perceptual quality improvement over diffusion-centric methods, and attains better or comparable metrics to latest volumetric rendering SOTA while better preserving fabric and hair details.
Paper Structure (21 sections, 11 equations, 5 figures, 5 tables)

This paper contains 21 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: PoseCraft uses tokenized 3D landmarks and camera parameters to synthesize photorealistic humans. This explicit 3D control delivers sharp silhouettes, preserves high-frequency details, and ensures structural coherence across novel viewpoints.
  • Figure 2: Pipeline overview of (A) RigCraft and (B, C) PoseCraft. (A) RigCraft performs multi-view fusion to generate temporally stable 3D landmarks from 2D poses. (B) During training, PoseCraft learns to denoise latent representations conditioned on tokenized 3D landmarks and camera parameters. (C) At inference, PoseCraft synthesizes a photorealistic image from a target 3D pose and camera view.
  • Figure 3: PoseCraft Architecture. We guide a latent diffusion UNet with two conditioning streams. First, a 3D Control Tokenizer converts camera extrinsics and 3D body landmarks into discrete tokens, which are injected via cross-attention for explicit 3D control. Second, a 2D skeleton projection is concatenated with the noisy latent input to provide direct spatial guidance.
  • Figure 4: Qualitative comparison on the GenHumanRF test split isik2023humanrf. The proposed PoseCraft method largely outperforms the image-based CFLD lu2024cfld, CHAMP zhu2024champ, T2I-Adapter mou2024t2i and ControlNet zhang2023controlnet, and compare better or comparatively to the volumetric rendering SOTA Animatable Gaussian li2024animatablegaussians.
  • Figure 5: Qualitative visualization of the RigCraft 3D landmark estimation and refinement. From top to bottom: (1) noisy 3D landmarks from re-projected 2D OpenPose detections; (2) landmarks after multi-view triangulation; and (3) the final, temporally coherent RigCraft outputs after smoothing.