Table of Contents
Fetching ...

V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data

Rotem Shalev-Arkushin, Aharon Azulay, Tavi Halperin, Eitan Richardson, Amit H. Bermano, Ohad Fried

TL;DR

V-LASIK addresses consistent glasses removal in videos by learning from imperfect synthetic data generated with an adjusted pretrained diffusion model. It introduces cross-frame attention during data generation, Inside-Out Normalization (ION), and a motion-prior-based editing pipeline to ensure temporal coherence and identity preservation, then finetunes a diffusion model on these pairs. The approach achieves state-of-the-art performance on glasses removal and generalizes to sticker removal, demonstrating the power of synthetic data and strong video priors for local editing without paired data. Overall, the work shows that carefully crafted synthetic data and priors can enable high-quality, local video edits while preserving identity and temporal consistency.

Abstract

Diffusion-based generative models have recently shown remarkable image and video editing capabilities. However, local video editing, particularly removal of small attributes like glasses, remains a challenge. Existing methods either alter the videos excessively, generate unrealistic artifacts, or fail to perform the requested edit consistently throughout the video. In this work, we focus on consistent and identity-preserving removal of glasses in videos, using it as a case study for consistent local attribute removal in videos. Due to the lack of paired data, we adopt a weakly supervised approach and generate synthetic imperfect data, using an adjusted pretrained diffusion model. We show that despite data imperfection, by learning from our generated data and leveraging the prior of pretrained diffusion models, our model is able to perform the desired edit consistently while preserving the original video content. Furthermore, we exemplify the generalization ability of our method to other local video editing tasks by applying it successfully to facial sticker-removal. Our approach demonstrates significant improvement over existing methods, showcasing the potential of leveraging synthetic data and strong video priors for local video editing tasks.

V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data

TL;DR

V-LASIK addresses consistent glasses removal in videos by learning from imperfect synthetic data generated with an adjusted pretrained diffusion model. It introduces cross-frame attention during data generation, Inside-Out Normalization (ION), and a motion-prior-based editing pipeline to ensure temporal coherence and identity preservation, then finetunes a diffusion model on these pairs. The approach achieves state-of-the-art performance on glasses removal and generalizes to sticker removal, demonstrating the power of synthetic data and strong video priors for local editing without paired data. Overall, the work shows that carefully crafted synthetic data and priors can enable high-quality, local video edits while preserving identity and temporal consistency.

Abstract

Diffusion-based generative models have recently shown remarkable image and video editing capabilities. However, local video editing, particularly removal of small attributes like glasses, remains a challenge. Existing methods either alter the videos excessively, generate unrealistic artifacts, or fail to perform the requested edit consistently throughout the video. In this work, we focus on consistent and identity-preserving removal of glasses in videos, using it as a case study for consistent local attribute removal in videos. Due to the lack of paired data, we adopt a weakly supervised approach and generate synthetic imperfect data, using an adjusted pretrained diffusion model. We show that despite data imperfection, by learning from our generated data and leveraging the prior of pretrained diffusion models, our model is able to perform the desired edit consistently while preserving the original video content. Furthermore, we exemplify the generalization ability of our method to other local video editing tasks by applying it successfully to facial sticker-removal. Our approach demonstrates significant improvement over existing methods, showcasing the potential of leveraging synthetic data and strong video priors for local video editing tasks.
Paper Structure (21 sections, 2 equations, 8 figures, 1 table)

This paper contains 21 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Glasses-removal from a blinking eye by image editing methods Left to right: LEDITS tsaban2023ledits, Instruct pix2pix brooks2023instructpix2pix, Lyu et al. lyu2022portrait, Stable Diffusion inpaint rombach2022high, ControlNet inpaint zhang2023adding, our synthetic dataset generation result, and our final result. As image editing methods expect high quality images with people looking straight to the camera, they struggle when these constraints are not met. In our dataset result 'Synth data', as a result of the cross-frame attention, eye artifacts appear. However, our model is still able to learn from the imperfect data and remove the glasses better than any out-of-the-box method, and better than the data it was trained on.
  • Figure 2: Cross-frame attention importance in data generation. Cross-frame attention helps removing glasses remnants, even when the mask is not perfect (left example) and reducing glasses reflections (right example).
  • Figure 3: Method overview:Step 1: we create an imperfect synthetic paired dataset by generating glasses masks for each video frame and inpainting it. We inpaint each frame using an adjusted version ControlNet inpaint zhang2023adding. We replace the self-attention layers with cross-frame attention (cf attn) and use blending between the generated latent images and the noised masked original latent images at each diffusion step. The generated data in the first step is imperfect; e.g. in the middle frame, the person blinks, however its generated pair has open eyes. Nevertheless, the data is good enough for finetuning an image-to-image diffusion model and achieving satisfactory results, due to the strong prior of the model. Step 2: Given our trained model for the task of removing glasses from images, we incorporate it with a motion prior module to generate temporally consistent videos without glasses from previously unseen videos. To obtain the original frame colors, at each diffusion step we blend the generated frames with the noised original masked latent images, and before decoding, we apply an Inside-Out Normalization (ION), to better align the statistics within the masked area and the area outside of the mask.
  • Figure 4: Visual comparisons: We compare our results to different video editing and inpainting methods. Other methods often struggle with glasses-removal, and even when they do remove the glasses, they tend to leave glasses remnants (e.g. RAVE right example), generate artifacts (e.g. FGT, ProPainter examples, TokenFlow left example), do not preserve the identity of the person (e.g. RAVE Left example), or their eyelids position (e.g. RAVE both examples).
  • Figure 5: Visual comparison: We compare our results to different video editing and inpainting methods. Other methods often struggle with glasses-removal, and even when they do remove the glasses, they tend to either change the identity completely (e.g. TokenFlow right example), generate artifacts such as black areas around the eyes (e.g. FGT, ProPainter both examples), or do not preserve the eyelids position (e.g. TokenFlow, RAVE right example)
  • ...and 3 more figures