Table of Contents
Fetching ...

Context-Aware Input Orchestration for Video Inpainting

Hoyoung Kim, Azimbek Khudoyberdiev, Seonghwan Jeong, Jihoon Ryoo

TL;DR

This work addresses memory constraints in mobile video inpainting by proposing AdaptIn, a context-aware pipeline that dynamically selects input frames based on visual dynamics inferred from mask changes and optical-flow cues. By analyzing how input composition interacts with dynamic content, the authors show that increasing neighboring frames benefits fast-changing scenes while static contexts benefit from reference frames, enabling a memory-quality tradeoff that preserves inpainting quality on edge devices. The approach is validated across flow-guided and transformer-based inpaintors, demonstrating improved temporal coherence and perceptual quality in dynamic content, with practical implications for on-device video restoration and editing. Overall, AdaptIn provides a principled, context-aware strategy to balance memory usage and inpainting quality for real-time applications on mobile and AR devices.

Abstract

Traditional neural network-driven inpainting methods struggle to deliver high-quality results within the constraints of mobile device processing power and memory. Our research introduces an innovative approach to optimize memory usage by altering the composition of input data. Typically, video inpainting relies on a predetermined set of input frames, such as neighboring and reference frames, often limited to five-frame sets. Our focus is to examine how varying the proportion of these input frames impacts the quality of the inpainted video. By dynamically adjusting the input frame composition based on optical flow and changes of the mask, we have observed an improvement in various contents including rapid visual context changes.

Context-Aware Input Orchestration for Video Inpainting

TL;DR

This work addresses memory constraints in mobile video inpainting by proposing AdaptIn, a context-aware pipeline that dynamically selects input frames based on visual dynamics inferred from mask changes and optical-flow cues. By analyzing how input composition interacts with dynamic content, the authors show that increasing neighboring frames benefits fast-changing scenes while static contexts benefit from reference frames, enabling a memory-quality tradeoff that preserves inpainting quality on edge devices. The approach is validated across flow-guided and transformer-based inpaintors, demonstrating improved temporal coherence and perceptual quality in dynamic content, with practical implications for on-device video restoration and editing. Overall, AdaptIn provides a principled, context-aware strategy to balance memory usage and inpainting quality for real-time applications on mobile and AR devices.

Abstract

Traditional neural network-driven inpainting methods struggle to deliver high-quality results within the constraints of mobile device processing power and memory. Our research introduces an innovative approach to optimize memory usage by altering the composition of input data. Typically, video inpainting relies on a predetermined set of input frames, such as neighboring and reference frames, often limited to five-frame sets. Our focus is to examine how varying the proportion of these input frames impacts the quality of the inpainted video. By dynamically adjusting the input frame composition based on optical flow and changes of the mask, we have observed an improvement in various contents including rapid visual context changes.

Paper Structure

This paper contains 30 sections, 1 equation, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Conceptual Description of Input Configuration for Video Inpainting. In the scenario of inpainting frames streamed, we adapt the input composition based on the dynamics of the visual context to efficiently utilize memory while preserving quality.
  • Figure 2: Example of reference frames and neighboring frames as input for video inpainting. Neighboring frames refer to the input frames that are either the target frame to be inpainted or frames adjacent to the target frame. Reference frames are those frames that are temporally distant from the target frames.
  • Figure 3: PSNR over the Ratio of Reference Frames in Input Frames across Different Videos. The x-axis represents the ratio of reference frames within the input frames. As x increases, the proportion of reference frames becomes higher, whereas a lower value indicates a higher proportion of neighboring frames. The PSNR patterns vary according to this ratio for each video. In the cases of rhino.mp4and bear.mp4, the PSNR increased as the proportion of reference frames increased. This can be interpreted as the visual context of these two videos being slow. Conversely, for walking.mp4, the quality improved as the proportion of neighboring frames increased. This suggests that the visual context is fast and thus more influenced by the surrounding frames.
  • Figure 4: The Maximum Change Rates in PSNR across Optical Flow (a) and Change of Mask (b). The used model here is ProPainter. The y-axis represents the maximum change rate of PSNR according to the input frame composition, as shown in Figure \ref{['fig:psnr_videos']}. A positive value indicates that the PSNR is at its maximum when the proportion of reference frames is high, while a negative value indicates that the PSNR is at its maximum when the proportion of neighboring frames is high. As the optical flow and mask variation increase, the tendency shifts towards the negative direction. This indicates that the influence of neighboring frames becomes higher as the visual context becomes faster.
  • Figure 5: Comparison of PSNR Change Rates and their Distributions between the Non-flow-based Inpainter, STTN, and the Flow-guided Inpainter, ProPainter. The steeper the slope, the more it can be seen as being influenced by optical flow.
  • ...and 6 more figures