Table of Contents
Fetching ...

Pippo: High-Resolution Multi-View Humans from a Single Image

Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito, Timur Bagautdinov

TL;DR

Pippo tackles the challenge of generating dense, high-resolution multi-view human imagery from a single image without relying on ground-truth camera parameters or explicit 3D priors. It introduces a three-stage training pipeline and a diffusion-transformer architecture with Spatial Anchor and Plücker-based conditioning, plus an attention-biasing mechanism to scale the number of views at inference. A novel 3D-consistency metric RE@SG is proposed to evaluate geometric fidelity without paired ground truth, and experiments demonstrate state-of-the-art results at 1K resolution across studio and casual iPhone inputs. This work advances scalable, photorealistic multi-view human synthesis with practical impact for entertainment, fashion, and AR applications.

Abstract

We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image.

Pippo: High-Resolution Multi-View Humans from a Single Image

TL;DR

Pippo tackles the challenge of generating dense, high-resolution multi-view human imagery from a single image without relying on ground-truth camera parameters or explicit 3D priors. It introduces a three-stage training pipeline and a diffusion-transformer architecture with Spatial Anchor and Plücker-based conditioning, plus an attention-biasing mechanism to scale the number of views at inference. A novel 3D-consistency metric RE@SG is proposed to evaluate geometric fidelity without paired ground truth, and experiments demonstrate state-of-the-art results at 1K resolution across studio and casual iPhone inputs. This work advances scalable, photorealistic multi-view human synthesis with practical impact for entertainment, fashion, and AR applications.

Abstract

We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image.

Paper Structure

This paper contains 21 sections, 11 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Pippo generates high-resolution, multi-view, studio-quality images from a single photo. In each sample, the left-most image is the input, followed by novel generated views of unseen subjects. First and second rows show generations from Full-body and Face-only photos captured in-the-wild using a mobile phone. Third row shows generation from a Head-only studio image. Last row illustrates Pippo 's capability to faithfully blend observed and generated content, alongside the corresponding ground truth.
  • Figure 2: Pipeline overview. This is an illustration of how we train our model. (Left) we use data from a studio capture and train our multi-view diffusion model (right). We condition on a full reference photo and a cropped face, as well as the target view cameras and 2D projected spatial anchor indicating head position and orientation. Our diffusion model also takes in noisy target views and a timestep in order to predict the denoised views (top). In practice, we apply a segmentation mask around the person.
  • Figure 3: DiT and ControlMLP Block. Our DiT block (left) loosely follows stable_diffusion, with a AdaIn-based timestep modulation. We apply attention and MLP blocks in parallel pmlr-v202-dehghani23a, and jointly apply self-attention to the noisy generated and identity conditioning tokens. ControlMLP block (right) is used to provide lightweight spatially-aligned conditioning - Plücker and Spatial Anchor.
  • Figure 4: Entropy vs Growth Factor ($\gamma$) for varying number of views (tokens) (\ref{['ssec:scaling_views']}). We present the entropy results (Y-axis) from our Attention Biasing technique inspired from jin2023training for varying number of tokens (individual line plots), and across different scaling growth factor $\gamma$ introduced in Eq. \ref{['attn_bias']} (X-axis). On X-axis, "No scaling" refers to the default attention formulation vaswani2017attention and $\gamma=1.0$ refers previous work jin2023training formulation. Empirically, we find that a slightly higher value of $\gamma=1.4$ leads to best visuals.
  • Figure 5: Generations under varying strengths of growth factor $\gamma$ (\ref{['ssec:scaling_views']}). On each row we show the generated views across vanilla attention vaswani2017attention (No scaling), prior work jin2023training and our formulation Eq. \ref{['attn_bias']}. It can be seen that growth factor ($\gamma$) greater than 1.0 is crucial to mitigate the entropy buildup. We show only 10 views per row subsampled evenly from 60 views generated at $512 \times 512$ resolution. The model was trained to jointly denoised only 12 views ($N_i = 5*N_t$).
  • ...and 9 more figures