Adapting MIMO video restoration networks to low latency constraints

Valéry Dewil; Zhe Zheng; Arnaud Barral; Lara Raad; Nao Nicolas; Ioannis Cassagne; Jean-michel Morel; Gabriele Facciolo; Bruno Galerne; Pablo Arias

Adapting MIMO video restoration networks to low latency constraints

Valéry Dewil, Zhe Zheng, Arnaud Barral, Lara Raad, Nao Nicolas, Ioannis Cassagne, Jean-michel Morel, Gabriele Facciolo, Bruno Galerne, Pablo Arias

TL;DR

This work addresses the challenge of restoring high-quality video under low-latency constraints with MIMO architectures, where limited future frames degrade temporal receptive field and introduce step-like artifacts at stack boundaries. It proposes two simple, architecture-agnostic fixes—recurrence across MIMO stacks and overlapping output stacks (ROSO)—and demonstrates their effectiveness on three state-of-the-art low-latency networks (BasicVSR++, ReMoNet, M2Mnet), improving both reconstruction quality and temporal consistency. The authors provide extensive quantitative and qualitative analyses, including a new drone-centric benchmark to stress temporal stability, and show that the proposed methods yield state-of-the-art performance for low-latency video denoising. Overall, the approach offers practical improvements for real-time video restoration and highlights temporal-consistency issues not captured by traditional benchmarks.

Abstract

MIMO (multiple input, multiple output) approaches are a recent trend in neural network architectures for video restoration problems, where each network evaluation produces multiple output frames. The video is split into non-overlapping stacks of frames that are processed independently, resulting in a very appealing trade-off between output quality and computational cost. In this work we focus on the low-latency setting by limiting the number of available future frames. We find that MIMO architectures suffer from problems that have received little attention so far, namely (1) the performance drops significantly due to the reduced temporal receptive field, particularly for frames at the borders of the stack, (2) there are strong temporal discontinuities at stack transitions which induce a step-wise motion artifact. We propose two simple solutions to alleviate these problems: recurrence across MIMO stacks to boost the output quality by implicitly increasing the temporal receptive field, and overlapping of the output stacks to smooth the temporal discontinuity at stack transitions. These modifications can be applied to any MIMO architecture. We test them on three state-of-the-art video denoising networks with different computational cost. The proposed contributions result in a new state-of-the-art for low-latency networks, both in terms of reconstruction error and temporal consistency. As an additional contribution, we introduce a new benchmark consisting of drone footage that highlights temporal consistency issues that are not apparent in the standard benchmarks.

Adapting MIMO video restoration networks to low latency constraints

TL;DR

Abstract

Paper Structure (15 sections, 5 figures, 5 tables)

This paper contains 15 sections, 5 figures, 5 tables.

Videos for qualitative evaluation
Networks and training details
BasicVSR++ chan2022basicvsr++
Training details.
ReMoNet xiang2022remonet
Training details.
M2Mnet chen2021multiframe
Training details.
Datasets
Additional quantitative results
PSNR/SSIM.
Overlapped vs previous output RAS.
PSNR-runtime landscape.
PSNR and temporal consistency profile.
Additional visual results

Figures (5)

Figure 1: Training strategy used for ReMoNet. The RTF is recurrent. The MOA takes the outputs of 5 RTFs, which require a total of 7 frames to be produced. Instead, we load 9 frames and compute 7 RTF outputs. At random during training we select a number of warm-up RTFs (0, 1 or 2) whose outputs are discarded and use the following 5 RTF outputs. In this way we train the MOA on RTFs with different warm-ups.
Figure 2: Landscape of video denoising networks derived from the results presented in the main papers. The vertical axis shows the averaged PSNR obtained over noise levels $\sigma=10,20,30,40,50$. The horizontal axis shows the running time per frame, measures on a Nvidia A100 GPU. The methods plotted with a star are variants proposed in this paper. As a reference, we added in lightgray the methods which takes as input the full video.
Figure 3: Per-frame PSNR and TC for BasicVSR++ and ReMoNet. For each frame we show the average PSNR on the drone benchmark for AWGN denoising with $\sigma=30$. These curves show the same behaviour than the one of M2Mnet shown in Figure 1 of the main paper.
Figure 4: Results of M2Mnet, BasicVSR++ and ReMoNet at a stack transition. For each, we compare the baseline network with the recurrent version (+ RAS) and the proposed method (+ ROSO). We show the two last frames of one stack and the first frame of the next stack. Between them, we display the difference (after aligning). For visualization purpose, the contrast has been enhanced. For the difference images, we map the range $[-27.7, 20]$ to [0,255].
Figure 5: Results of M2Mnet, BasicVSR++ and ReMoNet at a stack transition. For each, we compare the baseline network with the recurrent version (+ RAS) and the proposed method (+ ROSO). We show the two last frames of one stack and the first frame of the next stack. Between them, we display the difference (after aligning). For visualization purpose, the contrast has been enhanced. For the difference images, we map the range $[-27.7, 20]$ to [0,255].

Adapting MIMO video restoration networks to low latency constraints

TL;DR

Abstract

Adapting MIMO video restoration networks to low latency constraints

Authors

TL;DR

Abstract

Table of Contents

Figures (5)