Table of Contents
Fetching ...

ANVIL: Accelerator-Native Video Interpolation via Codec Motion Vector Priors

Shibo Liu

Abstract

Real-time 30-to-60 fps video frame interpolation on mobile neural processing units (NPUs) requires each synthesized frame within 33.3 ms. We show that mainstream flow-based video frame interpolation faces three structural deployment barriers on mobile NPUs: spatial sampling operators exceed the frame budget or lack hardware support, iterative flow refinement collapses under 8-bit integer post-training quantization, and memory-bound operators dominate the inference graph. ANVIL addresses these barriers by reusing motion vectors from the H.264/AVC decoder to prealign input frames, removing learned optical flow, spatial sampling, and iterative accumulation from the accelerator graph. The remaining residual is refined by a convolution-dominated network composed almost entirely of compute-bound operators. On a Snapdragon 8 Gen 3 device, ANVIL achieves 12.8 ms 1080p inference at 8-bit integer precision; an open-source Android player sustains 28.4 ms median end-to-end latency over 30-minute continuous playback. Per-operator causal analysis identifies quantized accumulation on recurrent flow states as a key mechanism behind integer quantization failure in iterative methods. The current design targets H.264/AVC playback with decoder-exposed motion vectors.

ANVIL: Accelerator-Native Video Interpolation via Codec Motion Vector Priors

Abstract

Real-time 30-to-60 fps video frame interpolation on mobile neural processing units (NPUs) requires each synthesized frame within 33.3 ms. We show that mainstream flow-based video frame interpolation faces three structural deployment barriers on mobile NPUs: spatial sampling operators exceed the frame budget or lack hardware support, iterative flow refinement collapses under 8-bit integer post-training quantization, and memory-bound operators dominate the inference graph. ANVIL addresses these barriers by reusing motion vectors from the H.264/AVC decoder to prealign input frames, removing learned optical flow, spatial sampling, and iterative accumulation from the accelerator graph. The remaining residual is refined by a convolution-dominated network composed almost entirely of compute-bound operators. On a Snapdragon 8 Gen 3 device, ANVIL achieves 12.8 ms 1080p inference at 8-bit integer precision; an open-source Android player sustains 28.4 ms median end-to-end latency over 30-minute continuous playback. Per-operator causal analysis identifies quantized accumulation on recurrent flow states as a key mechanism behind integer quantization failure in iterative methods. The current design targets H.264/AVC playback with decoder-exposed motion vectors.

Paper Structure

This paper contains 34 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: ANVIL three-processor pipeline. CPU densifies and downsamples MVs (${\sim}$2.9 ms); GPU (Vulkan) performs median filtering, Gaussian blur, and sub-pixel remap (${\sim}$3.7 ms); HTP runs the INT8 residual network (${\sim}$13--17 ms). CPU/GPU preparation for frame $N{+}1$ is pipelined with HTP inference for frame $N$.
  • Figure 2: UNet-v3b architecture. Input: 6-channel prealigned pair. Encoder: 4 levels with channel widths [16, 32, 64, 64] (ANVIL-S) or [16, 32, 96, 96] (ANVIL-M); encoder blocks per level $(1,1,1,2)$ (ANVIL-S) or $(1,1,2,2)$ (ANVIL-M). Bottleneck: 4 ResBlocks (ANVIL-S) or 8 (ANVIL-M) at $\frac{H}{16} \times \frac{W}{16}$. Decoder uses Add skip connections with 1 ResBlock per level. Output: 3-channel residual added to the prealigned blend. BN is folded into Conv at deploy time.
  • Figure 3: Visual comparison on Xiph 1080p ($3\times$ magnified insets). (a) old_town_cross: ANVIL smoothing suppresses noise, RIFE preserves detail. (b) tractor: ANVIL over-smooths edges, RIFE produces ghosting. (c) riverbed: both fail on stochastic texture.
  • Figure 4: Quality scaling with model capacity. Basic prealignment (dashed) and smoothed prealignment (solid) series both scale without saturation. Dashed horizontal lines show RIFE HDv3 (3.04M) and NAFNet ceiling (17.1M, smoothed prealignment retrained) as reference points.