Video Frame Interpolation for Polarization via Swin-Transformer

Feng Huang; Xin Zhang; Yixuan Xu; Xuesong Wang; Xianyu Wu

Video Frame Interpolation for Polarization via Swin-Transformer

Feng Huang, Xin Zhang, Yixuan Xu, Xuesong Wang, Xianyu Wu

TL;DR

The paper addresses the challenge of interpolating polarized video frames, where polarization signals vary with viewpoint and traditional VFI methods struggle to preserve polarization cues. It introduces Swin-VFI, a multi-stage, multi-scale Video Swin Transformer that leverages local shifted-cube self-attention to capture long-range spatiotemporal dependencies with reduced computation. A polarization-aware loss, combining intensity and polarization terms, guides the network to recover AoLP and DoLP accurately. Evaluations on polarized datasets PVFI-Mono and PHSPD, as well as conventional VFI benchmarks, show that Swin-VFI achieves superior reconstruction accuracy for intensity and polarization metrics while offering significant parameter and FLOPS reductions, enabling effective SfP and human-shape reconstruction tasks. Future work will extend to color-polarized video interpolation and broader polarization modalities.

Abstract

Video Frame Interpolation (VFI) has been extensively explored and demonstrated, yet its application to polarization remains largely unexplored. Due to the selective transmission of light by polarized filters, longer exposure times are typically required to ensure sufficient light intensity, which consequently lower the temporal sample rates. Furthermore, because polarization reflected by objects varies with shooting perspective, focusing solely on estimating pixel displacement is insufficient to accurately reconstruct the intermediate polarization. To tackle these challenges, this study proposes a multi-stage and multi-scale network called Swin-VFI based on the Swin-Transformer and introduces a tailored loss function to facilitate the network's understanding of polarization changes. To ensure the practicality of our proposed method, this study evaluates its interpolated frames in Shape from Polarization (SfP) and Human Shape Reconstruction tasks, comparing them with other state-of-the-art methods such as CAIN, FLAVR, and VFIT. Experimental results demonstrate our approach's superior reconstruction accuracy across all tasks.

Video Frame Interpolation for Polarization via Swin-Transformer

TL;DR

Abstract

Paper Structure (27 sections, 10 equations, 10 figures, 7 tables)

This paper contains 27 sections, 10 equations, 10 figures, 7 tables.

Introduction
Related Works
Video Frame Interpolation
Phase-based methods
Flow-based methods
Kernel-based methods
Vision Transformer
Shape from Polarization
PROPOSED METHOD
Polarization Imaging Mechanism
Structure of the neural network
Swin-VFI
Video Frame Interpolation for Polarized Video
PVFI-Mono Dataset
Loss Function
...and 12 more sections

Figures (10)

Figure 1: The necessity and challenge of VFI for polarization. (a) The intensity of polarized light is much weaker after passing through micro-polarizer array of a DoFP polarimeter. (b) Upper: AoLP-DoLP visualization, where AoLP and DoLP are mapped to hue and brightness, respectively. Lower: Variation of AoLP induced by alterations of the shooting perspective, posing a challenge for polarized video frame interpolation. (Note the polarizer's AoLP change indicated by the arrow during the rotation process)
Figure 2: (a) A naive expansion of Swin-Transformer to spatial-temporal space. (b) An simple illustration of Swin-VFI, where the boundaries of the dimensions are connected, and the cubes with the same color are merged and masked after being shifted.
Figure 3: The overall pipeline of Swin-VFI. (a) Multi-stage Architecture. (b) Multi-scale Transformer. (c) An illustration of Swin Transformer blocks, which contains two successive Multi-head Self-Attention blocks. (d) Feed Forward Network. (e) Brief explanation of Multi-head Self-Attention.
Figure 4: (a) The capture scene of the PVFI-Mono dataset. (b) The shooting target in the rotation scenario. (c) The shooting target in the translation scenario.
Figure 5: (a): Captured images $\mathrm{I}_{\text{0}}, \mathrm{I}_{\text{1}}$ and their corresponding AoLP and DoLP visualizations $\mathrm{A_0, A_1}$. (The red arrow points to the same polarizer.) (b): Visualizations of AoLP and DoLP obtained from the pretrained models provided by FLAVRkalluri2023flavr, VFIT-S shi2022video, VFIT-B shi2022video , as well as our proposed Swin-VFI method. (c): Visualizations of AoLP and DoLP $\mathrm{A_{0.5}}$ for the intermediate frame $\mathrm{I_{0.5}}$ between $\mathrm{I_0}$ and $\mathrm{I_1}$.
...and 5 more figures

Video Frame Interpolation for Polarization via Swin-Transformer

TL;DR

Abstract

Video Frame Interpolation for Polarization via Swin-Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (10)