Pippo: High-Resolution Multi-View Humans from a Single Image
Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito, Timur Bagautdinov
TL;DR
Pippo tackles the challenge of generating dense, high-resolution multi-view human imagery from a single image without relying on ground-truth camera parameters or explicit 3D priors. It introduces a three-stage training pipeline and a diffusion-transformer architecture with Spatial Anchor and Plücker-based conditioning, plus an attention-biasing mechanism to scale the number of views at inference. A novel 3D-consistency metric RE@SG is proposed to evaluate geometric fidelity without paired ground truth, and experiments demonstrate state-of-the-art results at 1K resolution across studio and casual iPhone inputs. This work advances scalable, photorealistic multi-view human synthesis with practical impact for entertainment, fashion, and AR applications.
Abstract
We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image.
