Table of Contents
Fetching ...

Generating Fit Check Videos with a Handheld Camera

Bowei Chen, Brian Curless, Ira Kemelmacher-Shlizerman, Steven M. Seitz

TL;DR

This work enables photorealistic fit-check videos from handheld phones by leveraging two mirror selfies and IMU motion data to synthesize a target motion in a chosen background. It introduces a diffusion-based video generation framework with a parameter-free frame generation strategy, multi-reference attention for fusing front and back appearances, and an image-based fine-tuning stage to sharpen frames and improve shadows and reflections. An IMU-driven motion and background retrieval pipeline ensures coherent motion with compatible backgrounds. Experiments on a large fit-check dataset and self-captured selfies show superior realism, back-view accuracy, and lighting integration compared with baselines, highlighting practical potential for accessible, high-quality self-captured videos.

Abstract

Self-captured full-body videos are popular, but most deployments require mounted cameras, carefully-framed shots, and repeated practice. We propose a more convenient solution that enables full-body video capture using handheld mobile devices. Our approach takes as input two static photos (front and back) of you in a mirror, along with an IMU motion reference that you perform while holding your mobile phone, and synthesizes a realistic video of you performing a similar target motion. We enable rendering into a new scene, with consistent illumination and shadows. We propose a novel video diffusion-based model to achieve this. Specifically, we propose a parameter-free frame generation strategy and a multi-reference attention mechanism to effectively integrate appearance information from both the front and back selfies into the video diffusion model. Further, we introduce an image-based fine-tuning strategy to enhance frame sharpness and improve shadows and reflections generation for more realistic human-scene composition.

Generating Fit Check Videos with a Handheld Camera

TL;DR

This work enables photorealistic fit-check videos from handheld phones by leveraging two mirror selfies and IMU motion data to synthesize a target motion in a chosen background. It introduces a diffusion-based video generation framework with a parameter-free frame generation strategy, multi-reference attention for fusing front and back appearances, and an image-based fine-tuning stage to sharpen frames and improve shadows and reflections. An IMU-driven motion and background retrieval pipeline ensures coherent motion with compatible backgrounds. Experiments on a large fit-check dataset and self-captured selfies show superior realism, back-view accuracy, and lighting integration compared with baselines, highlighting practical potential for accessible, high-quality self-captured videos.

Abstract

Self-captured full-body videos are popular, but most deployments require mounted cameras, carefully-framed shots, and repeated practice. We propose a more convenient solution that enables full-body video capture using handheld mobile devices. Our approach takes as input two static photos (front and back) of you in a mirror, along with an IMU motion reference that you perform while holding your mobile phone, and synthesizes a realistic video of you performing a similar target motion. We enable rendering into a new scene, with consistent illumination and shadows. We propose a novel video diffusion-based model to achieve this. Specifically, we propose a parameter-free frame generation strategy and a multi-reference attention mechanism to effectively integrate appearance information from both the front and back selfies into the video diffusion model. Further, we introduce an image-based fine-tuning strategy to enhance frame sharpness and improve shadows and reflections generation for more realistic human-scene composition.

Paper Structure

This paper contains 9 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: From two mirror selfie photos (top-left), we generate a photorealistic video of you performing a desired motion against a compatible desired background (bottom), with realistic shading, shadows, and reflections. The motion is captured using your mobile phone's IMU sensors (top-right). A target use case is self-captured "fit check" videos of you showing off an outfit.
  • Figure 2: Method Overview.Left: We train our model on a fit-check video dataset using pairs of front and back images, GT video frames, an inpainted background, and a pose sequence. Our frame generation strategy and multi-reference attention effectively encode features of multiple reference images. Middle: We fine-tune the trained model on a high-quality image dataset, supervising the generated frames using the input front or back image to enhance frame quality. Right: During inference, the method takes front and back selfies, a retrieved pose sequence, and a retrieved background as input, generating a video with the first two frames removed. The VAE decoders are omitted.
  • Figure 3: Model Ablations. The inputs are shown in (a). The naive method (b) fails to render accurate back views. ReferenceNet (c) improves back-view generation but introduces hood artifacts and blurs text. Additionally, it requires extra parameters, reducing model efficiency. Our frame generation strategy (d) produces better back views than (b) without extra parameters, though text remains blurry. Multi-reference attention (e) enhances back view patterns, and adding the fine-tuning stage (f) delivers sharp, recognizable text.
  • Figure 4: Multi-Reference Attention. Given a pre-attention feature map (left), we duplicate and concatenate the front (blue) and back (green) features with all frame features (gray) (middle). For batch processing, the front and back features are also concatenated with themselves. The combined features then pass through self-attention layers, and we extract the first third of the output along the width axis as the final result (right).
  • Figure 5: Our Results. The left two columns show the input selfies and background, while the right six columns display the generated results (inset: pose input, locations adjusted to avoid occlusion). Given mirror selfies with various outfits and lighting conditions, our method generates realistic fit-check videos, accurately capturing appearance across diverse poses. Additionally, it generates reflections (rows 1–3) and shadows (row 4) on the ground, ensuring natural integration with both indoor and outdoor backgrounds.
  • ...and 3 more figures