Adapting MIMO video restoration networks to low latency constraints
Valéry Dewil, Zhe Zheng, Arnaud Barral, Lara Raad, Nao Nicolas, Ioannis Cassagne, Jean-michel Morel, Gabriele Facciolo, Bruno Galerne, Pablo Arias
TL;DR
This work addresses the challenge of restoring high-quality video under low-latency constraints with MIMO architectures, where limited future frames degrade temporal receptive field and introduce step-like artifacts at stack boundaries. It proposes two simple, architecture-agnostic fixes—recurrence across MIMO stacks and overlapping output stacks (ROSO)—and demonstrates their effectiveness on three state-of-the-art low-latency networks (BasicVSR++, ReMoNet, M2Mnet), improving both reconstruction quality and temporal consistency. The authors provide extensive quantitative and qualitative analyses, including a new drone-centric benchmark to stress temporal stability, and show that the proposed methods yield state-of-the-art performance for low-latency video denoising. Overall, the approach offers practical improvements for real-time video restoration and highlights temporal-consistency issues not captured by traditional benchmarks.
Abstract
MIMO (multiple input, multiple output) approaches are a recent trend in neural network architectures for video restoration problems, where each network evaluation produces multiple output frames. The video is split into non-overlapping stacks of frames that are processed independently, resulting in a very appealing trade-off between output quality and computational cost. In this work we focus on the low-latency setting by limiting the number of available future frames. We find that MIMO architectures suffer from problems that have received little attention so far, namely (1) the performance drops significantly due to the reduced temporal receptive field, particularly for frames at the borders of the stack, (2) there are strong temporal discontinuities at stack transitions which induce a step-wise motion artifact. We propose two simple solutions to alleviate these problems: recurrence across MIMO stacks to boost the output quality by implicitly increasing the temporal receptive field, and overlapping of the output stacks to smooth the temporal discontinuity at stack transitions. These modifications can be applied to any MIMO architecture. We test them on three state-of-the-art video denoising networks with different computational cost. The proposed contributions result in a new state-of-the-art for low-latency networks, both in terms of reconstruction error and temporal consistency. As an additional contribution, we introduce a new benchmark consisting of drone footage that highlights temporal consistency issues that are not apparent in the standard benchmarks.
