Table of Contents
Fetching ...

UniINR: Event-guided Unified Rolling Shutter Correction, Deblurring, and Interpolation

Yunfan LU, Guoqiang Liang, Yusheng Wang, Lin Wang, Hui Xiong

TL;DR

This work tackles recovering high-frame-rate GS frames from RS blur frames captured under motion by leveraging paired events. It introduces UniINR, a one-stage framework that unifies RS correction, deblurring, and interpolation through a Spatial-Temporal Implicit Encoding (STE) that produces a STR, Exposure Time Embedding (ETE) that injects per-pixel exposure into a temporal tensor, and a Pixel-by-pixel Decoder (PPD) that renders frames from $T$ and $\theta$ via $I=f_d(T,\theta)=f_{mlp}^{\circlearrowright^5}(T\oplus\theta)$. The model is lightweight (0.379M parameters) and achieves 2.83 ms per frame for 31× interpolation, outperforming state-of-the-art methods on simulated and real datasets, with losses guided by RS blur consistency $\mathcal{L}_b$ and GS frame reconstruction $\mathcal{L}_{re}$. Ablation studies confirm the value of learned exposure-time embedding and show improved robustness to complex motion; code availability supports reproducibility and practical deployment in event-guided imaging tasks.

Abstract

Video frames captured by rolling shutter (RS) cameras during fast camera movement frequently exhibit RS distortion and blur simultaneously. Naturally, recovering high-frame-rate global shutter (GS) sharp frames from an RS blur frame must simultaneously consider RS correction, deblur, and frame interpolation. A naive way is to decompose the whole process into separate tasks and cascade existing methods; however, this results in cumulative errors and noticeable artifacts. Event cameras enjoy many advantages, e.g., high temporal resolution, making them potential for our problem. To this end, we propose the first and novel approach, named UniINR, to recover arbitrary frame-rate sharp GS frames from an RS blur frame and paired events. Our key idea is unifying spatial-temporal implicit neural representation (INR) to directly map the position and time coordinates to color values to address the interlocking degradations. Specifically, we introduce spatial-temporal implicit encoding (STE) to convert an RS blur image and events into a spatial-temporal representation (STR). To query a specific sharp frame (GS or RS), we embed the exposure time into STR and decode the embedded features pixel-by-pixel to recover a sharp frame. Our method features a lightweight model with only 0.38M parameters, and it also enjoys high inference efficiency, achieving 2.83ms/frame in 31 times frame interpolation of an RS blur frame. Extensive experiments show that our method significantly outperforms prior methods. Code is available at https://github.com/yunfanLu/UniINR.

UniINR: Event-guided Unified Rolling Shutter Correction, Deblurring, and Interpolation

TL;DR

This work tackles recovering high-frame-rate GS frames from RS blur frames captured under motion by leveraging paired events. It introduces UniINR, a one-stage framework that unifies RS correction, deblurring, and interpolation through a Spatial-Temporal Implicit Encoding (STE) that produces a STR, Exposure Time Embedding (ETE) that injects per-pixel exposure into a temporal tensor, and a Pixel-by-pixel Decoder (PPD) that renders frames from and via . The model is lightweight (0.379M parameters) and achieves 2.83 ms per frame for 31× interpolation, outperforming state-of-the-art methods on simulated and real datasets, with losses guided by RS blur consistency and GS frame reconstruction . Ablation studies confirm the value of learned exposure-time embedding and show improved robustness to complex motion; code availability supports reproducibility and practical deployment in event-guided imaging tasks.

Abstract

Video frames captured by rolling shutter (RS) cameras during fast camera movement frequently exhibit RS distortion and blur simultaneously. Naturally, recovering high-frame-rate global shutter (GS) sharp frames from an RS blur frame must simultaneously consider RS correction, deblur, and frame interpolation. A naive way is to decompose the whole process into separate tasks and cascade existing methods; however, this results in cumulative errors and noticeable artifacts. Event cameras enjoy many advantages, e.g., high temporal resolution, making them potential for our problem. To this end, we propose the first and novel approach, named UniINR, to recover arbitrary frame-rate sharp GS frames from an RS blur frame and paired events. Our key idea is unifying spatial-temporal implicit neural representation (INR) to directly map the position and time coordinates to color values to address the interlocking degradations. Specifically, we introduce spatial-temporal implicit encoding (STE) to convert an RS blur image and events into a spatial-temporal representation (STR). To query a specific sharp frame (GS or RS), we embed the exposure time into STR and decode the embedded features pixel-by-pixel to recover a sharp frame. Our method features a lightweight model with only 0.38M parameters, and it also enjoys high inference efficiency, achieving 2.83ms/frame in 31 times frame interpolation of an RS blur frame. Extensive experiments show that our method significantly outperforms prior methods. Code is available at https://github.com/yunfanLu/UniINR.
Paper Structure (24 sections, 7 equations, 19 figures, 7 tables)

This paper contains 24 sections, 7 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: Inputs and the outputs of our method, EvUnRoll zhou2022evunroll, and EvUnRoll zhou2022evunroll+TimeLens tulyakov2021time. Inputs are shown in (a), which includes an RS blur frame and events. $t_s$ and $t_e$ are the start and end timestamps of RS, and $t_{exp}$ is the exposure time. Our outputs are shown in (b), which is a sequence of GS sharp frames during the whole exposure time($ts$, $t_e+t_{exp}$) of the RS blur image. (c) shows outputs of EvUnRoll, which can only recover the GS sharp frames in a limited time interval (red interval) instead of the whole exposure time of the RS blur frame. (d) shows outputs of cascade methods of EvUnRoll+TimeLens. More details are in Supp. Mat.
  • Figure 2: An overview of our framework, which consists of three parts, (a) the Spatial-Temporal Implicit Encoding (STE), (b) Exposure Time Embedding (ETE), and (c) Pixel-by-pixel decoding (PPD). Details of STE, ETE, and PPD are described in Sec. \ref{['sec:STE']}, Sec. \ref{['sec:ETE']}, and Sec. \ref{['sec:PPD']}. The inputs are an RS blur frame $I_{rsb}$ and events, and the outputs are a sequence of GS frames and RS frames. RS frames are predicted only in training.
  • Figure 3: Visual results for RS correction on Fastec-Origliu2020deep dataset.
  • Figure 4: Visual results for RS correction+Deblur on Gev-Origzhou2022evunroll dataset.
  • Figure 5: Quantitative comparison for RS correction + deblurring on Gev-Orig dataset zhou2022evunroll. The numerical results of DSUN, JCD, and EvUnRoll are provided by zhou2022evunroll.
  • ...and 14 more figures