Table of Contents
Fetching ...

VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation

Chenyang Wu, Jiayi Fu, Chun-Le Guo, Shuhao Han, Chongyi Li

TL;DR

High-resolution video frame interpolation is hindered by large motion and edge artifacts when flows are estimated at low resolution and upsampled. VTinker introduces Guided Flow Upsampling (GFU) to refine upsampled flows with guidance from high-resolution frames and a Texture Mapping pipeline that constructs an intermediate proxy and replaces misaligned pixel fusion with texture blocks sourced from the inputs, yielding sharper edges and better texture continuity. The method achieves state-of-the-art performance on 4K and other high-resolution benchmarks, supported by extensive ablations showing the effectiveness of GFU and texture-based synthesis, as well as perceptual-loss-based supervision. This approach offers practical benefits for high-quality, high-resolution video synthesis, reducing ghosting and mosaic artifacts while preserving fine details.

Abstract

Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows' edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows' edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: https://github.com/Wucy0519/VTinker.

VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation

TL;DR

High-resolution video frame interpolation is hindered by large motion and edge artifacts when flows are estimated at low resolution and upsampled. VTinker introduces Guided Flow Upsampling (GFU) to refine upsampled flows with guidance from high-resolution frames and a Texture Mapping pipeline that constructs an intermediate proxy and replaces misaligned pixel fusion with texture blocks sourced from the inputs, yielding sharper edges and better texture continuity. The method achieves state-of-the-art performance on 4K and other high-resolution benchmarks, supported by extensive ablations showing the effectiveness of GFU and texture-based synthesis, as well as perceptual-loss-based supervision. This approach offers practical benefits for high-quality, high-resolution video synthesis, reducing ghosting and mosaic artifacts while preserving fine details.

Abstract

Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows' edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows' edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: https://github.com/Wucy0519/VTinker.

Paper Structure

This paper contains 20 sections, 18 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: This figure compares VTinker with current flow-based method for high-resolution VFI. After Motion Estimation (M.E.), bidirectional flows are obtained at low resolution. (a) The current method employs bilinear upsampling for the high-resolution flows, followed by a pixel-by-pixel synthesis. (b) VTinker employs the proposed Guided Flow Upsampling (GFU) and generates the final result through texture mapping.
  • Figure 2: Architecture overview of the proposed VTinker. Given two consecutive frames ${I_0}$ & ${I_1}$, VTinker first estimates the bi-directional flows ${F_{0 \to 1}}$ & ${F_{1 \to 0}}$, which are fused by warping to produce an intermediate proxy. Then, after extracting features of the input frames, VTinker divides the features into texture blocks. Through Flow-guided Block-Texture Searching and Local Matching for Block-Texture, texture blocks corresponding to each position of the proxy are selected. Finally, VTinker maps these block-textures to the proxy and rebuilds the interpolated frame by a reconstruction module.
  • Figure 3: Guided Flow Upsampling (GFU) Module. The difference between bilinear upsampling and GFU is visually shown above the line. The edges of flow upsampled by GFU are more aligned with the input frame than the bilinear. Blue and orange points indicate pixels at low resolution. The framework of GFU is below the line. GFU uses the input frame as guidance information to refine the flow upsampling. Remarkable contrast is shown in Tab. \ref{['tab:xrsy']} and Fig. \ref{['fig:xrsy']}.
  • Figure 4: Qualitative comparisons among different methods on 2K resolution. All cases move more than 140 pixels. Overlay is the average of two input frames.
  • Figure 5: Qualitative comparisons among different methods on 4K. All cases move more than 200 pixels. Zoom in for details.
  • ...and 7 more figures