Generating Fit Check Videos with a Handheld Camera
Bowei Chen, Brian Curless, Ira Kemelmacher-Shlizerman, Steven M. Seitz
TL;DR
This work enables photorealistic fit-check videos from handheld phones by leveraging two mirror selfies and IMU motion data to synthesize a target motion in a chosen background. It introduces a diffusion-based video generation framework with a parameter-free frame generation strategy, multi-reference attention for fusing front and back appearances, and an image-based fine-tuning stage to sharpen frames and improve shadows and reflections. An IMU-driven motion and background retrieval pipeline ensures coherent motion with compatible backgrounds. Experiments on a large fit-check dataset and self-captured selfies show superior realism, back-view accuracy, and lighting integration compared with baselines, highlighting practical potential for accessible, high-quality self-captured videos.
Abstract
Self-captured full-body videos are popular, but most deployments require mounted cameras, carefully-framed shots, and repeated practice. We propose a more convenient solution that enables full-body video capture using handheld mobile devices. Our approach takes as input two static photos (front and back) of you in a mirror, along with an IMU motion reference that you perform while holding your mobile phone, and synthesizes a realistic video of you performing a similar target motion. We enable rendering into a new scene, with consistent illumination and shadows. We propose a novel video diffusion-based model to achieve this. Specifically, we propose a parameter-free frame generation strategy and a multi-reference attention mechanism to effectively integrate appearance information from both the front and back selfies into the video diffusion model. Further, we introduce an image-based fine-tuning strategy to enhance frame sharpness and improve shadows and reflections generation for more realistic human-scene composition.
