Table of Contents
Fetching ...

Smooth regularization for efficient video recognition

Gil Goldman, Raja Giryes, Mahadev Satyanarayanan

TL;DR

The paper tackles efficient video action recognition by enforcing a temporal regularization that makes frame embeddings evolve smoothly. The core method, Gaussian Random Walk (GRW) smoothing, combines a frame-order contrastive loss with a smoothness prior to penalize high accelerations in frame embeddings, culminating in the objective $\mathcal{L}_{CE} + \lambda \mathcal{L}_{smooth}$. Applied to lightweight backbones such as MoViNet and MobileNetV3 on Kinetics-600/400, GRW improves Top-1 accuracy by $3.8\%$ to $6.4\%$, setting new state-of-the-art under compute and memory budgets. The approach is plug-and-play with low overhead, and prompts future work on broader architectures, dynamic windows, and a deeper theoretical understanding of GRW's impact on optimization and representation learning.

Abstract

We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/gilgoldm/grw-smoothing.

Smooth regularization for efficient video recognition

TL;DR

The paper tackles efficient video action recognition by enforcing a temporal regularization that makes frame embeddings evolve smoothly. The core method, Gaussian Random Walk (GRW) smoothing, combines a frame-order contrastive loss with a smoothness prior to penalize high accelerations in frame embeddings, culminating in the objective . Applied to lightweight backbones such as MoViNet and MobileNetV3 on Kinetics-600/400, GRW improves Top-1 accuracy by to , setting new state-of-the-art under compute and memory budgets. The approach is plug-and-play with low overhead, and prompts future work on broader architectures, dynamic windows, and a deeper theoretical understanding of GRW's impact on optimization and representation learning.

Abstract

We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/gilgoldm/grw-smoothing.

Paper Structure

This paper contains 18 sections, 3 theorems, 28 equations, 4 figures, 7 tables.

Key Result

Theorem 1

Given $T\ge 3$, let $Z^* = \arg\min_{Z\in \mathcal{Z}^T} \mathcal{L}(Z)$. Then

Figures (4)

  • Figure 1: Performance Results on Kinetics-600. By simply adding GRW-smoothing to existing models, we achieve significant improvements. Left: Accuracy vs. FLOPs, where each point corresponds to a published model (see Table \ref{['tab:k600.flops']} for references). GRW-smoothing improves the state-of-the-art performance of efficient models by 3.8–6.1%. Notably, MoViNet-A3-GRW achieves 85.6% accuracy at just 56.4 GFLOPs, while the closest model, MViTv2-B-32×3, requires 18.3$\times$ more FLOPs. Right: Accuracy vs. Memory. $\textrm{GRW}$-smoothing improves the state-of-the-art performance of memory-efficient models by 4.9–6.4%.
  • Figure 2: Warm-up Example.Top: The used Airplanes dataset containing 1,000 training and 100 test short videos of model airplanes performing one of three rotations, starting from a random position. The dataset isolates temporal classification, as any single frame is independent of the rotation label. Bottom: Output embeddings of two identical models trained with and without the smoothness term. In green, blue and red are typical clips embeddings for Yaw, Pitch and Roll, respectively, projected to the first two principal components of the embedded test set. Each point is a single frame embedding. The index is the clip frame index.
  • Figure 3: Intermediate layer smoothing. The encodings $\tilde{Z}$ are global-pooled along the spatial dimensions, then normalized across the batch dimension, where we use BN without learnable parameters. The sub-clips $Z^c$ are fed into $\textrm{GRW}$.
  • Figure 4: Final layer smoothing. Output encodings $\varphi(X)=\tilde{Z}$ of a given video model are affine transformed to $Z$. The sub-clips $Z^c=(\boldsymbol{\mathrm{z}}_{cT}, \dots, \boldsymbol{\mathrm{z}}_{(c+1)T-1})$ are fed into $\textrm{GRW}$ regularization, as an additional loss term, then further processed using a few Attention layers.

Theorems & Definitions (8)

  • Remark 1
  • Theorem 1: $GRW$-smoothing scale
  • Remark 2
  • proof
  • Proposition 1: Uniform Lower Bound on $\mathcal{L}$
  • Proposition 2: Uniform Configuration Upper Bound
  • proof : Proof of Proposition \ref{['prop:uniform.lower']}
  • proof : Proof of Proposition \ref{['prop:uniform.conf.upper']}