Context-Aware Input Orchestration for Video Inpainting
Hoyoung Kim, Azimbek Khudoyberdiev, Seonghwan Jeong, Jihoon Ryoo
TL;DR
This work addresses memory constraints in mobile video inpainting by proposing AdaptIn, a context-aware pipeline that dynamically selects input frames based on visual dynamics inferred from mask changes and optical-flow cues. By analyzing how input composition interacts with dynamic content, the authors show that increasing neighboring frames benefits fast-changing scenes while static contexts benefit from reference frames, enabling a memory-quality tradeoff that preserves inpainting quality on edge devices. The approach is validated across flow-guided and transformer-based inpaintors, demonstrating improved temporal coherence and perceptual quality in dynamic content, with practical implications for on-device video restoration and editing. Overall, AdaptIn provides a principled, context-aware strategy to balance memory usage and inpainting quality for real-time applications on mobile and AR devices.
Abstract
Traditional neural network-driven inpainting methods struggle to deliver high-quality results within the constraints of mobile device processing power and memory. Our research introduces an innovative approach to optimize memory usage by altering the composition of input data. Typically, video inpainting relies on a predetermined set of input frames, such as neighboring and reference frames, often limited to five-frame sets. Our focus is to examine how varying the proportion of these input frames impacts the quality of the inpainted video. By dynamically adjusting the input frame composition based on optical flow and changes of the mask, we have observed an improvement in various contents including rapid visual context changes.
