Table of Contents
Fetching ...

Towards Online Real-Time Memory-based Video Inpainting Transformers

Guillaume Thiry, Hao Tang, Radu Timofte, Luc Van Gool

TL;DR

This work addresses online real-time video inpainting by adapting transformer-based models to memory-efficient, real-time inference. It introduces three progressively refined variants—Online baseline, Memory-based, and Refined memory-based—each leveraging past computations and cross-model communication to reach real-time throughput above $20$ FPS while aiming to preserve reconstruction quality. Across three backbones (DSTT, FuseFormer, E2FGVI) and two datasets (DAVIS, YouTube-VOS), the memory-based approaches substantially boost FPS (often $\sim$2–3×) with some quality trade-offs, while the refined model mitigates much of this loss by reinpainting past frames and exchanging information between parallel inpaintors. The results demonstrate a practical path toward live, high-quality video inpainting, with code and pretrained models to be released on acceptance, enabling broader application in live broadcasting and augmented perception.

Abstract

Video inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers. Although these models show promising reconstruction quality and temporal consistency, they are still unsuitable for live videos, one of the last steps to make them completely convincing and usable. The main limitations are that these state-of-the-art models inpaint using the whole video (offline processing) and show an insufficient frame rate. In our approach, we propose a framework to adapt existing inpainting transformers to these constraints by memorizing and refining redundant computations while maintaining a decent inpainting quality. Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second. The code and pretrained models will be made available upon acceptance.

Towards Online Real-Time Memory-based Video Inpainting Transformers

TL;DR

This work addresses online real-time video inpainting by adapting transformer-based models to memory-efficient, real-time inference. It introduces three progressively refined variants—Online baseline, Memory-based, and Refined memory-based—each leveraging past computations and cross-model communication to reach real-time throughput above FPS while aiming to preserve reconstruction quality. Across three backbones (DSTT, FuseFormer, E2FGVI) and two datasets (DAVIS, YouTube-VOS), the memory-based approaches substantially boost FPS (often 2–3×) with some quality trade-offs, while the refined model mitigates much of this loss by reinpainting past frames and exchanging information between parallel inpaintors. The results demonstrate a practical path toward live, high-quality video inpainting, with code and pretrained models to be released on acceptance, enabling broader application in live broadcasting and augmented perception.

Abstract

Video inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers. Although these models show promising reconstruction quality and temporal consistency, they are still unsuitable for live videos, one of the last steps to make them completely convincing and usable. The main limitations are that these state-of-the-art models inpaint using the whole video (offline processing) and show an insufficient frame rate. In our approach, we propose a framework to adapt existing inpainting transformers to these constraints by memorizing and refining redundant computations while maintaining a decent inpainting quality. Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second. The code and pretrained models will be made available upon acceptance.
Paper Structure (13 sections, 4 equations, 6 figures, 4 tables)

This paper contains 13 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Original inpainting model and its natural online adaptation. (a) The model inpaints a window centered around $f = 20$ with a radius $k = 5$. Reference frames sampled at a rate $r = 10$ are added as input of the model. (b) In online inpainting, we can only see the past frames. To inpaint the frame $18$, we use a window and sampled frames from the past. The whole window is still predicted but only the last frame is effectively used.
  • Figure 2: Transformers in the baseline and memory-based models. (a) Without memory, the baseline model processes all the frames in each transformer, making it quadratically complex. (b) When the memory of the previous inpaintings is kept, only the new frame (18) needs to be computed, while the transformers can still use the other frames (0 to 17) as context. After each transformer, the new result is saved for later. Each frame is saved in the memory as much times as there are transformers, each value being different. Following Equation \ref{['eq3']}, we have here $f = 18$, $s = 5$ and $r = 10$.
  • Figure 3: Memory-based and refined models. (a) Thanks to inpainting memory of the last seen frames, the new frame is the only one to be computed but the inpainting still benefits from this previous context. Following Equation \ref{['eq3']}, we have here $f = 18$, $s = 5$, and $r = 10$. (b) In this model, the online inpainter still uses memory of the last frames it inpainted, but it also receives information from the inpaintings of the second model. This refining inpainter performs a slower but better inpainting as it can work directly on windows. If tuned correctly, both models process the video at the same speed, so that the refined memory is always relevant to the online inpainter. Following Equation \ref{['eq4']}, we have here $f = 18$, $t = 14$ (for this example), $s = s' = 3$, and $r' = 10$ (tunable parameters).
  • Figure 4: PSNR/FPS operating points on each backbone, using different input sizes.
  • Figure 5: Mean PSNR and SSIM at each frame on YouTube-VOS (500+ videos). Models use FuseFormer backbone, values are smoothed with a moving average of 10 frames. On the right, we show the values differences with the offline model. The two most performing models are able to partly close the quality gap with the offline one as they discover more frames to use.
  • ...and 1 more figures