Towards Online Real-Time Memory-based Video Inpainting Transformers
Guillaume Thiry, Hao Tang, Radu Timofte, Luc Van Gool
TL;DR
This work addresses online real-time video inpainting by adapting transformer-based models to memory-efficient, real-time inference. It introduces three progressively refined variants—Online baseline, Memory-based, and Refined memory-based—each leveraging past computations and cross-model communication to reach real-time throughput above $20$ FPS while aiming to preserve reconstruction quality. Across three backbones (DSTT, FuseFormer, E2FGVI) and two datasets (DAVIS, YouTube-VOS), the memory-based approaches substantially boost FPS (often $\sim$2–3×) with some quality trade-offs, while the refined model mitigates much of this loss by reinpainting past frames and exchanging information between parallel inpaintors. The results demonstrate a practical path toward live, high-quality video inpainting, with code and pretrained models to be released on acceptance, enabling broader application in live broadcasting and augmented perception.
Abstract
Video inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers. Although these models show promising reconstruction quality and temporal consistency, they are still unsuitable for live videos, one of the last steps to make them completely convincing and usable. The main limitations are that these state-of-the-art models inpaint using the whole video (offline processing) and show an insufficient frame rate. In our approach, we propose a framework to adapt existing inpainting transformers to these constraints by memorizing and refining redundant computations while maintaining a decent inpainting quality. Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second. The code and pretrained models will be made available upon acceptance.
