Table of Contents
Fetching ...

Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions

Thinesh Thiyakesan Ponbagavathi, Alina Roitberg

TL;DR

This work addresses the challenge of recognizing nearly symmetric actions in video by revealing that standard image-to-video probes are inherently permutation-invariant and fail to leverage temporal order. It introduces STEP, a parameter-efficient probing approach that injects temporal sensitivity via learnable frame-wise positional encodings, a single global CLS token, and a simplified attention mechanism. STEP consistently outperforms probing baselines and PEFT methods across four datasets, achieving state-of-the-art results on IKEA-ASM and Drive&Act, and demonstrates strong data efficiency, including low-data SSv2 scenarios. The findings highlight the importance of explicit temporal modeling in light-weight probes and offer a practical, parameter-efficient path for transferring image foundation models to video action recognition.

Abstract

We study parameter-efficient image-to-video probing for the unaddressed challenge of recognizing nearly symmetric actions - visually similar actions that unfold in opposite temporal order (e.g., opening vs. closing a bottle). Existing probing mechanisms for image-pretrained models, such as DinoV2 and CLIP, rely on attention mechanism for temporal modeling but are inherently permutation-invariant, leading to identical predictions regardless of frame order. To address this, we introduce Self-attentive Temporal Embedding Probing (STEP), a simple yet effective approach designed to enforce temporal sensitivity in parameter-efficient image-to-video transfer. STEP enhances self-attentive probing with three key modifications: (1) a learnable frame-wise positional encoding, explicitly encoding temporal order; (2) a single global CLS token, for sequence coherence; and (3) a simplified attention mechanism to improve parameter efficiency. STEP outperforms existing image-to-video probing mechanisms by 3-15% across four activity recognition benchmarks with only 1/3 of the learnable parameters. On two datasets, it surpasses all published methods, including fully fine-tuned models. STEP shows a distinct advantage in recognizing nearly symmetric actions, surpassing other probing mechanisms by 9-19%. and parameter-heavier PEFT-based transfer methods by 5-15%. Code and models will be made publicly available.

Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions

TL;DR

This work addresses the challenge of recognizing nearly symmetric actions in video by revealing that standard image-to-video probes are inherently permutation-invariant and fail to leverage temporal order. It introduces STEP, a parameter-efficient probing approach that injects temporal sensitivity via learnable frame-wise positional encodings, a single global CLS token, and a simplified attention mechanism. STEP consistently outperforms probing baselines and PEFT methods across four datasets, achieving state-of-the-art results on IKEA-ASM and Drive&Act, and demonstrates strong data efficiency, including low-data SSv2 scenarios. The findings highlight the importance of explicit temporal modeling in light-weight probes and offer a practical, parameter-efficient path for transferring image foundation models to video action recognition.

Abstract

We study parameter-efficient image-to-video probing for the unaddressed challenge of recognizing nearly symmetric actions - visually similar actions that unfold in opposite temporal order (e.g., opening vs. closing a bottle). Existing probing mechanisms for image-pretrained models, such as DinoV2 and CLIP, rely on attention mechanism for temporal modeling but are inherently permutation-invariant, leading to identical predictions regardless of frame order. To address this, we introduce Self-attentive Temporal Embedding Probing (STEP), a simple yet effective approach designed to enforce temporal sensitivity in parameter-efficient image-to-video transfer. STEP enhances self-attentive probing with three key modifications: (1) a learnable frame-wise positional encoding, explicitly encoding temporal order; (2) a single global CLS token, for sequence coherence; and (3) a simplified attention mechanism to improve parameter efficiency. STEP outperforms existing image-to-video probing mechanisms by 3-15% across four activity recognition benchmarks with only 1/3 of the learnable parameters. On two datasets, it surpasses all published methods, including fully fine-tuned models. STEP shows a distinct advantage in recognizing nearly symmetric actions, surpassing other probing mechanisms by 9-19%. and parameter-heavier PEFT-based transfer methods by 5-15%. Code and models will be made publicly available.

Paper Structure

This paper contains 18 sections, 5 equations, 8 figures, 22 tables.

Figures (8)

  • Figure 1: We showcase that attentive probing commonly used for parameter-efficient image-to-video transfer in activity recognition is invariant to temporal order of frames. Disrupting the frame order at test-time does not impact the outcome (left bar chart), making it hard to distinguish nearly symmetric actions -- actions that are visually similar but differ in the sequence of events (e.g., picking up vs. placing an object). We propose Self-attentive Temporal Embedding Probing (Step), which makes self-attentive probing sensitive to changes in frame order (right bar chart), leading to better recognition of fine-grained and nearly symmetric actions, with fewer learnable parameters.
  • Figure 2: Overview of Self-attentive Temporal Embedding Probing. Each video frame is first independently processed by a frozen image model. We replace the frame-specific CLS token with learned patch-wise temporal encodings, while a newly added frame-global CLS token encourages temporal consistency in predictions, followed by a self-attention probing mechanism that keeps track of the temporal order through these modifications.
  • Figure 3: Class-wise Accuracy comparison of nearly symmetric actions in IKEA-ASM dataset. STEP outperforms PEFT methods and probing baselines, excelling in fine-grained action recognition.
  • Figure 4: Overview of Probing Mechanisms for Video Action Recognition. (Top left) Linear probing uses mean pooling over frame-wise CLS tokens. (Top right) Self-attention probing incorporates frame-wise embeddings with self-attention and mean pooling. (Bottom left) Attentive probing leverages cross-attention with a learnable query. (Bottom right) Step integrates frame-wise patch tokens, temporal embeddings, and a global CLS token with self-attention for temporal modeling.
  • Figure 5: Architectural comparison of standard transformer block, no-FF variant, and the proposed simplified attention block used in our Step framework, highlighting the removal of non-linear components.
  • ...and 3 more figures