Table of Contents
Fetching ...

Controllable Human-centric Keyframe Interpolation with Generative Prior

Zujin Guo, Size Wu, Zhongang Cai, Wei Li, Chen Change Loy

TL;DR

This work tackles the challenge of controllable, high-fidelity human-centric keyframe interpolation across large temporal gaps by introducing PoseFuse3D-KI, a diffusion-based framework that integrates 3D guidance from SMPL-X with 2D pose signals. A novel 3D-informed control model (PoseFuse3D) encodes 3D geometry through a dedicated SMPL-X encoder and fuses it with 2D cues via a two-stage attention-based fusion, with the entire system conditioned into a pre-trained Video Diffusion Model (Wan2.1) using cross-normalization and LoRA tuning. The authors validate their approach on CHKI-Video, a new dataset with rich 2D/3D annotations, and demonstrate state-of-the-art performance on both whole-frame and human-centric metrics, including robustness to long temporal gaps and in-the-wild scenarios. They provide extensive ablations and analyses, showing the importance of 3D cues, joint/vertex aggregation, and the fusion design. Overall, PoseFuse3D-KI advances controllable human-centric video synthesis by combining explicit 3D geometry with 2D pose conditioning in a diffusion framework, while acknowledging limitations in SMPL-X accuracy, compute demands, and interactions with external objects.

Abstract

Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

Controllable Human-centric Keyframe Interpolation with Generative Prior

TL;DR

This work tackles the challenge of controllable, high-fidelity human-centric keyframe interpolation across large temporal gaps by introducing PoseFuse3D-KI, a diffusion-based framework that integrates 3D guidance from SMPL-X with 2D pose signals. A novel 3D-informed control model (PoseFuse3D) encodes 3D geometry through a dedicated SMPL-X encoder and fuses it with 2D cues via a two-stage attention-based fusion, with the entire system conditioned into a pre-trained Video Diffusion Model (Wan2.1) using cross-normalization and LoRA tuning. The authors validate their approach on CHKI-Video, a new dataset with rich 2D/3D annotations, and demonstrate state-of-the-art performance on both whole-frame and human-centric metrics, including robustness to long temporal gaps and in-the-wild scenarios. They provide extensive ablations and analyses, showing the importance of 3D cues, joint/vertex aggregation, and the fusion design. Overall, PoseFuse3D-KI advances controllable human-centric video synthesis by combining explicit 3D geometry with 2D pose conditioning in a diffusion framework, while acknowledging limitations in SMPL-X accuracy, compute demands, and interactions with external objects.

Abstract

Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

Paper Structure

This paper contains 26 sections, 4 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Keyframe Interpolation with Different Strategies. (a) Interpolation using I2V models without intermediate guidance often yields implausible or distorted frames, especially under large motion or occlusion. (b) Skeleton-guided interpolation offers structural cues but lacks geometric detail, resulting in unrealistic body shape and appearance. (c) Our PoseFuse3D-KI employs dense human-centric guidance, enabling temporally coherent and visually plausible interpolations.
  • Figure 2: Model Architecture. Our PoseFuse3D-KI framework, as shown in (a), comprises a video diffusion model (VDM) and a novel control model, PoseFuse3D. The PoseFuse3D model extracts rich features from both 3D and 2D control signals and fuses them into a unified representation to guide the VDM. The key component of PoseFuse3D is the SMPL-X encoder as illustrated in (b), which provides explicit 3D signal features. Specifically, the SMPL-X encoder first extracts 3D information from the SMPL-X model with 2D correspondences via projection. The 3D and 2D information is then encoded in parallel. With features of 2D correspondences, 3D information is aggregated onto the 2D image plane using attention mechanisms. The aggregated features are subsequently processed to produce the final feature $S^{3D}$.
  • Figure 3: Qualitative Results of Different 3D Control Strategies. We use red circles to highlight regions where the 3D controls and our strategy significantly improve the interpolation quality.
  • Figure 4: Qualitative Comparisons with State-of-The-Art Methods.
  • Figure 5: Qualitative Results of In-the-wild Control and Keyframe Interpolation.
  • ...and 3 more figures