Table of Contents
Fetching ...

Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal Networks

Matthew Kowal, Mennatullah Siam, Md Amirul Islam, Neil D. B. Bruce, Richard P. Wildes, Konstantinos G. Derpanis

TL;DR

The paper addresses the limited understanding of what static versus dynamic information deep spatiotemporal networks encode in their intermediate representations. It introduces a general, sampling-based mutual-information framework to quantify static and dynamic biases at the layer and unit levels, applied across action recognition, AVOS, and VIS. Key contributions include a unified bias metric with a per-channel classification, the StaticDropout debiasing method, and a systematic study of how architectures, datasets, and training dynamics shape these biases, plus architectural guidance to enhance dynamics. The findings reveal pervasive static bias across models, with two-stream cross connections and carefully chosen datasets enabling more dynamic representations and improved performance on dynamics-centric tasks.

Abstract

There is limited understanding of the information captured by deep spatiotemporal models in their intermediate representations. For example, while evidence suggests that action recognition algorithms are heavily influenced by visual appearance in single frames, no quantitative methodology exists for evaluating such static bias in the latent representation compared to bias toward dynamics. We tackle this challenge by proposing an approach for quantifying the static and dynamic biases of any spatiotemporal model, and apply our approach to three tasks, action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS). Our key findings are: (i) Most examined models are biased toward static information. (ii) Some datasets that are assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual channels in an architecture can be biased toward static, dynamic or a combination of the two. (iv) Most models converge to their culminating biases in the first half of training. We then explore how these biases affect performance on dynamically biased datasets. For action recognition, we propose StaticDropout, a semantically guided dropout that debiases a model from static information toward dynamics. For AVOS, we design a better combination of fusion and cross connection layers compared with previous architectures.

Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal Networks

TL;DR

The paper addresses the limited understanding of what static versus dynamic information deep spatiotemporal networks encode in their intermediate representations. It introduces a general, sampling-based mutual-information framework to quantify static and dynamic biases at the layer and unit levels, applied across action recognition, AVOS, and VIS. Key contributions include a unified bias metric with a per-channel classification, the StaticDropout debiasing method, and a systematic study of how architectures, datasets, and training dynamics shape these biases, plus architectural guidance to enhance dynamics. The findings reveal pervasive static bias across models, with two-stream cross connections and carefully chosen datasets enabling more dynamic representations and improved performance on dynamics-centric tasks.

Abstract

There is limited understanding of the information captured by deep spatiotemporal models in their intermediate representations. For example, while evidence suggests that action recognition algorithms are heavily influenced by visual appearance in single frames, no quantitative methodology exists for evaluating such static bias in the latent representation compared to bias toward dynamics. We tackle this challenge by proposing an approach for quantifying the static and dynamic biases of any spatiotemporal model, and apply our approach to three tasks, action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS). Our key findings are: (i) Most examined models are biased toward static information. (ii) Some datasets that are assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual channels in an architecture can be biased toward static, dynamic or a combination of the two. (iv) Most models converge to their culminating biases in the first half of training. We then explore how these biases affect performance on dynamically biased datasets. For action recognition, we propose StaticDropout, a semantically guided dropout that debiases a model from static information toward dynamics. For AVOS, we design a better combination of fusion and cross connection layers compared with previous architectures.
Paper Structure (27 sections, 8 equations, 15 figures, 1 table)

This paper contains 27 sections, 8 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: We introduce a general technique that, given a trained spatiotemporal model and a video dataset, can quantify the bias in any hidden representation within the model toward encoding static (red) or dynamic (blue) information. We use this technique to study action recognition (squares) and video segmentation (diamonds) and explore the effect of models and training datasets on static and dynamic biases.
  • Figure 2: Overview of our methodology for analysing bias toward static or dynamic information. We measure the dynamic and static biases in deep spatiotemporal models for three tasks: action recognition, automatic video object segmentation and video instance segmentation. (1) We sample video pairs that share either static, $(v^S_1, v^S_2)$, or dynamic, ($v^D_1,v^D_2$), information using video stylization texler2020interactive and frame shuffling or optical flow jitter (flow visualized in RGB format). (2) Given a pretrained model, $f_\theta$, we compute the mutual information (MI) between intermediate representations of video pairs, $z^F$, to assess the model's bias toward either factor on a per-layer, $l$, or per-channel (i.e. unit) basis.
  • Figure 3: Layer and unit-wise analyses on action recognition networks trained on Kinetics-400 carreira2017quo. Left: Layer-wise encoding of static and dynamic factors using the layer-wise metric, (Eq. \ref{['eq:biasscores']}), for: (a) single stream 3D CNNs, (b) SlowFast variants and (c) transformer variants. SF-Slow and SF-Fast denote the representation taken before the fusion layer from the slow and fast branches, resp. Right: Estimates of the dynamic, static, joint and residual units using the unit-wise metric, (Eq. \ref{['eq:ind_bias_scores_diff_b']}).
  • Figure 4: Layer and unit-wise analysis on off-the-shelf VOS networks. Left: Encoding of dynamic and static factors for motion, appearance streams and fusion layers in FusionSeg jain2017fusionseg, MATNet zhou2020motion and RTNet ren2021reciprocal using the layer-wise metric, (Eq. \ref{['eq:biasscores']}). Fusion layers are mostly biased toward the static factor. Right: Individual units analysis for the three models for fusion layer 5 using the unit-wise metric, (Eq. \ref{['eq:ind_bias_scores_diff_b']}). MATNet has the largest number of dynamic units.
  • Figure 5: Layer and unit-wise analysis on off-the-shelf state-of-the-art VIS models. Left: Encoding of dynamic and static factors for motion, appearance streams and fusion layers in VisTR-R50 and VisTR-101 wang2021end using the layer-wise metric, (Eq. \ref{['eq:biasscores']}). All layers are biased toward the static factor. Right: Individual units analysis for the two VisTR variants using the unit-wise metric, (Eq. \ref{['eq:ind_bias_scores_diff_b']}).
  • ...and 10 more figures