Table of Contents
Fetching ...

Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks

Matthew Dutson, Nathan Labiosa, Yin Li, Mohit Gupta

TL;DR

The paper tackles temporal instability in video inference by introducing universal, lightweight stabilization adapters that plug into pre-trained frame-based models without altering them. A unified accuracy-stability-robustness loss guides training, with theoretical bounds (oracle and collapse) to prevent over-smoothing or collapse. The approach uses EMA-based stabilizers, stabilization controllers, and optional spatial fusion to stabilize features and outputs across diverse tasks, improving robustness to temporal and transient corruptions. Empirical results across image enhancement, denoising, depth estimation, and segmentation demonstrate improved temporal consistency and resilience to corruptions and adverse conditions, with practical benefits for real-world video processing pipelines.

Abstract

When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.

Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks

TL;DR

The paper tackles temporal instability in video inference by introducing universal, lightweight stabilization adapters that plug into pre-trained frame-based models without altering them. A unified accuracy-stability-robustness loss guides training, with theoretical bounds (oracle and collapse) to prevent over-smoothing or collapse. The approach uses EMA-based stabilizers, stabilization controllers, and optional spatial fusion to stabilize features and outputs across diverse tasks, improving robustness to temporal and transient corruptions. Empirical results across image enhancement, denoising, depth estimation, and segmentation demonstrate improved temporal consistency and resilience to corruptions and adverse conditions, with practical benefits for real-world video processing pipelines.

Abstract

When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.

Paper Structure

This paper contains 37 sections, 26 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Stabilizing image-based networks.(top) Applying single-image models sequentially to the frames of a video can cause unstable predictions and failures under time-varying corruptions. In the top-right example, we see that randomly dropping patches causes artifacts in monocular depth estimates. (middle) We propose a method for injecting stabilizers into existing networks and for training these stabilizers using a unified accuracy-stability-robustness loss. (bottom) We demonstrate improvements in stability and robustness for various tasks, without modifying the original image-based models. Image sources: gharbi_2017_deep_bilateralrichter_2017_playing_forschmalfuss_2025_robustspring_benchmarkingyang_2024_depth-anything-v2.
  • Figure 2: Unified loss for one-dimensional predictions. We consider a time series of duration $\tau = 3$ consisting of one-dimensional predictions, with $\delta$ defined as the L1 distance. We assume that the first prediction $\hat{y}_1$ is fixed (cannot be modified by a stabilization adapter). We show the value of the second prediction $\hat{y}_2$ along the x-axis and the value of the third $\hat{y}_3$ along the y-axis, with contours indicating the value of the unified loss as these predictions vary. When $\lambda = 0$, the minimum occurs at the ground truth $\hat{y}_2 = y_2$ and $\hat{y}_3 = y_3$. When $\lambda$ is nonzero but below the oracle bound, the minimum still occurs at the ground truth, but the loss increases more slowly in the direction of stabler predictions. When $\lambda$ exceeds the collapse bound, the global minimum is the collapse state $\hat{y}_3 = \hat{y}_2 = y_1$.
  • Figure 3: Stabilization controllers. Starting with the existing network (red), we add stabilizers (yellow) to select layers. The degree of stabilization, i.e., the decay $\beta$, can be predicted by a stabilization controller (blue). This controller consists of a shared backbone $g$ and one head $h_i$ per stabilized layer. Stabilizers can be added to both internal layers and the model output.
  • Figure 4: Image enhancement results. Introducing a controller and spatial fusion to the stabilizer significantly improves the accuracy-stability tradeoff. The spatial-fusion stabilizer reduces frame-to-frame variation by up to $\approx 35\%$ while exceeding the quality of the base model. "Instability" here refers to negative stability ($-\mathcal{S}$); see Equation \ref{['eq:stability']}. The goal is to move toward $-x$ (lower instability) and $+y$ (better image quality).
  • Figure 5: Denoising results. Because it attempts to stabilize an iid noise residual, naive feature-space stabilization leads to worse PSNR and worse stability. We achieve the best performance with a controlled stabilizer, usually with spatial fusion.
  • ...and 9 more figures