Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks
Matthew Dutson, Nathan Labiosa, Yin Li, Mohit Gupta
TL;DR
The paper tackles temporal instability in video inference by introducing universal, lightweight stabilization adapters that plug into pre-trained frame-based models without altering them. A unified accuracy-stability-robustness loss guides training, with theoretical bounds (oracle and collapse) to prevent over-smoothing or collapse. The approach uses EMA-based stabilizers, stabilization controllers, and optional spatial fusion to stabilize features and outputs across diverse tasks, improving robustness to temporal and transient corruptions. Empirical results across image enhancement, denoising, depth estimation, and segmentation demonstrate improved temporal consistency and resilience to corruptions and adverse conditions, with practical benefits for real-world video processing pipelines.
Abstract
When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.
