Table of Contents
Fetching ...

DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration

Zheyan Zhang, Diego Klabjan, Renee CB Manworren

TL;DR

DiffMVR tackles occlusion restoration in dynamic video by leveraging two adaptive guidance images per frame—a symmetric current frame and a past fully visible frame—within a diffusion-based video inpainting framework. A dual cross-attention mechanism in the U-Net fuses structural and temporal cues, aided by a motion-consistency loss to ensure smooth frame-to-frame transitions. Empirical evaluations on infant-motion datasets show state-of-the-art performance in both frame-level and video-level metrics, with segmented masks yielding the best results. The work offers practical implications for real-time video restoration in healthcare and other dynamic settings and plans to open-source the code.

Abstract

In this work, we address a challenge in video inpainting: reconstructing occluded regions in dynamic, real-world scenarios. Motivated by the need for continuous human motion monitoring in healthcare settings, where facial features are frequently obscured, we propose a diffusion-based video-level inpainting model, DiffMVR. Our approach introduces a dynamic dual-guided image prompting system, leveraging adaptive reference frames to guide the inpainting process. This enables the model to capture both fine-grained details and smooth transitions between video frames, offering precise control over inpainting direction and significantly improving restoration accuracy in challenging, dynamic environments. DiffMVR represents a significant advancement in the field of diffusion-based inpainting, with practical implications for real-time applications in various dynamic settings.

DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration

TL;DR

DiffMVR tackles occlusion restoration in dynamic video by leveraging two adaptive guidance images per frame—a symmetric current frame and a past fully visible frame—within a diffusion-based video inpainting framework. A dual cross-attention mechanism in the U-Net fuses structural and temporal cues, aided by a motion-consistency loss to ensure smooth frame-to-frame transitions. Empirical evaluations on infant-motion datasets show state-of-the-art performance in both frame-level and video-level metrics, with segmented masks yielding the best results. The work offers practical implications for real-time video restoration in healthcare and other dynamic settings and plans to open-source the code.

Abstract

In this work, we address a challenge in video inpainting: reconstructing occluded regions in dynamic, real-world scenarios. Motivated by the need for continuous human motion monitoring in healthcare settings, where facial features are frequently obscured, we propose a diffusion-based video-level inpainting model, DiffMVR. Our approach introduces a dynamic dual-guided image prompting system, leveraging adaptive reference frames to guide the inpainting process. This enables the model to capture both fine-grained details and smooth transitions between video frames, offering precise control over inpainting direction and significantly improving restoration accuracy in challenging, dynamic environments. DiffMVR represents a significant advancement in the field of diffusion-based inpainting, with practical implications for real-time applications in various dynamic settings.

Paper Structure

This paper contains 18 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: DiffMVR Model Pipeline.
  • Figure 2: Occlusion removal and face restore results on the HOF Dataset inproceedings applying DiffMVR. The left shows good inpaint results, and the right has some bad results. Bad could mean occlusion removal failure, restored contents incompatible with the original object, and the mask area not seamlessly connecting with the unchanged regions.
  • Figure 3: Qualitative comparison of DiffMVR with the benchmarked models on the Baby dataset, including pain, move, and rest babies. Row $1$ displays inputs from the video sources at the $5^{th}$ second, leveraging segmented masks. The content is copyrighted and reprinted with permission. Rows $2,3,4$ show inpainting results applying DiffMVR trained on segmented masks, with guide $1$ from the $4^{th}$ second of videos; Tuned-runwayml, using text prompt "remove hands;" Tuned-stabilityai, using text prompt "remove hands, " respectively.