Table of Contents
Fetching ...

Interpreting Video Representations with Spatio-Temporal Sparse Autoencoders

Atahan Dokme, Sriram Vishwanath

Abstract

We present the first systematic study of Sparse Autoencoders (SAEs) on video representations. Standard SAEs decompose video into interpretable, monosemantic features but destroy temporal coherence: hard TopK selection produces unstable feature assignments across frames, reducing autocorrelation by 36%. We propose spatio-temporal contrastive objectives and Matryoshka hierarchical grouping that recover and even exceed raw temporal coherence. The contrastive loss weight controls a tunable trade-off between reconstruction and temporal coherence. A systematic ablation on two backbones and two datasets shows that different configurations excel at different goals: reconstruction fidelity, temporal coherence, action discrimination, or interpretability. Contrastive SAE features improve action classification by +3.9% over raw features and text-video retrieval by up to 2.8xR@1. A cross-backbone analysis reveals that standard monosemanticity metrics contain a backbone-alignment artifact: both DINOv2 and VideoMAE produce equally monosemantic features under neutral (CLIP) similarity. Causal ablation confirms that contrastive training concentrates predictive signal into a small number of identifiable features.

Interpreting Video Representations with Spatio-Temporal Sparse Autoencoders

Abstract

We present the first systematic study of Sparse Autoencoders (SAEs) on video representations. Standard SAEs decompose video into interpretable, monosemantic features but destroy temporal coherence: hard TopK selection produces unstable feature assignments across frames, reducing autocorrelation by 36%. We propose spatio-temporal contrastive objectives and Matryoshka hierarchical grouping that recover and even exceed raw temporal coherence. The contrastive loss weight controls a tunable trade-off between reconstruction and temporal coherence. A systematic ablation on two backbones and two datasets shows that different configurations excel at different goals: reconstruction fidelity, temporal coherence, action discrimination, or interpretability. Contrastive SAE features improve action classification by +3.9% over raw features and text-video retrieval by up to 2.8xR@1. A cross-backbone analysis reveals that standard monosemanticity metrics contain a backbone-alignment artifact: both DINOv2 and VideoMAE produce equally monosemantic features under neutral (CLIP) similarity. Causal ablation confirms that contrastive training concentrates predictive signal into a small number of identifiable features.

Paper Structure

This paper contains 39 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Contrastive pairing structures for the three spatio-temporal SAE variants. (a) Temporal: pairs the same spatial position across consecutive frames. (b) Separate: applies independent temporal (red, cross-frame) and spatial (blue, within-frame) contrastive losses. (c) Raster: serializes patches row-by-row, then across frames; at frame boundaries (orange dashed), the single contrastive loss automatically creates temporal pairs.
  • Figure 2: SAE features on DINOv2/SSv2. (a) A scene feature activating on flat surfaces across diverse actions (holding, pushing, pouring, dropping). (b) An action-correlated feature firing on "Pushing [something] off the table," with spatial activation concentrated on the pushing hand. (c, d) Temporal comparison: standard SAE features flicker across frames (top row), while Matryoshka SAE maintains consistent activation (bottom rows). SAE features decompose into three categories: scene features (most common, activate across actions), action-correlated features (spatially specific), and object features (respond to visual patterns regardless of action).
  • Figure 3: Qualitative text-video retrieval comparison (DINOv2/SSv2, Standard SAE). For each query, we show the top-5 retrieved clips using raw features (left) vs. SAE features (right). Green borders indicate exact template match (conservative: visually similar clips with different templates count as incorrect). SAE features retrieve 11/20 correct matches vs. 4/20 for raw features.