SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis

Chuqiao Wu; Jin Song; Yiyun Fei

SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis

Chuqiao Wu, Jin Song, Yiyun Fei

TL;DR

SkeleGuide is introduced, a novel framework built upon explicit skeletal reasoning that significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images.

Abstract

Generating realistic and structurally plausible human images into existing scenes remains a significant challenge for current generative models, which often produce artifacts like distorted limbs and unnatural poses. We attribute this systemic failure to an inability to perform explicit reasoning over human skeletal structure. To address this, we introduce SkeleGuide, a novel framework built upon explicit skeletal reasoning. Through joint training of its reasoning and rendering stages, SkeleGuide learns to produce an internal pose that acts as a strong structural prior, guiding the synthesis towards high structural integrity. For fine-grained user control, we introduce PoseInverter, a module that decodes this internal latent pose into an explicit and editable format. Extensive experiments demonstrate that SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images. Our work provides compelling evidence that explicitly modeling skeletal structure is a fundamental step towards robust and plausible human image synthesis.

SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis

TL;DR

Abstract

Paper Structure (28 sections, 6 equations, 5 figures, 3 tables)

This paper contains 28 sections, 6 equations, 5 figures, 3 tables.

Introduction
Related Work
Controllable Diffusion Models
The Challenge of Human Image Synthesis
From Disjointed Pipelines to Integrated Reasoning
Method
Preliminarys
Overall Framework
Unified Architecture and Condition Injection
Training and Inference Pipeline
Phase 1: Reasoning Module Pre-training.
Phase 2: Joint End-to-End Training.
Phase 3: Rendering Module Fine-tuning.
Inference Pipeline.
Fine-Grained Pose Editing via Latent Inversion
...and 13 more sections

Figures (5)

Figure 1: From just a scene image and a text prompt, SkeleGuide enables users to realistically place a person and precisely control the pose.
Figure 2: Overview of our SkeleGuide framework. Stage 1 (Skeletal Reasoning) generates a latent pose representation from text and a scene image. Stage 2 (Appearance Rendering) then synthesizes the final image conditioned on this latent pose. The optional control loop enables fine-grained editing of the intermediate pose.
Figure 3: General qualitative comparison with state-of-the-art methods. Across a variety of scenes, SkeleGuide demonstrates superior performance over specialized and general-purpose models.
Figure 4: Qualitative comparison with Person-in-Place. SkeleGuide yields more coherent and plausible skeletons and final images with fewer artifacts.
Figure 5: Joint training enforces structural coherence. Feedback from the rendering stage corrects the severe structural artifacts (e.g. incoherent limbs) produced when training Stage 1 alone.

SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis

TL;DR

Abstract

SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (5)