DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration
Zheyan Zhang, Diego Klabjan, Renee CB Manworren
TL;DR
DiffMVR tackles occlusion restoration in dynamic video by leveraging two adaptive guidance images per frame—a symmetric current frame and a past fully visible frame—within a diffusion-based video inpainting framework. A dual cross-attention mechanism in the U-Net fuses structural and temporal cues, aided by a motion-consistency loss to ensure smooth frame-to-frame transitions. Empirical evaluations on infant-motion datasets show state-of-the-art performance in both frame-level and video-level metrics, with segmented masks yielding the best results. The work offers practical implications for real-time video restoration in healthcare and other dynamic settings and plans to open-source the code.
Abstract
In this work, we address a challenge in video inpainting: reconstructing occluded regions in dynamic, real-world scenarios. Motivated by the need for continuous human motion monitoring in healthcare settings, where facial features are frequently obscured, we propose a diffusion-based video-level inpainting model, DiffMVR. Our approach introduces a dynamic dual-guided image prompting system, leveraging adaptive reference frames to guide the inpainting process. This enables the model to capture both fine-grained details and smooth transitions between video frames, offering precise control over inpainting direction and significantly improving restoration accuracy in challenging, dynamic environments. DiffMVR represents a significant advancement in the field of diffusion-based inpainting, with practical implications for real-time applications in various dynamic settings.
