Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions

Thinesh Thiyakesan Ponbagavathi; Alina Roitberg

Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions

Thinesh Thiyakesan Ponbagavathi, Alina Roitberg

TL;DR

This work addresses the challenge of recognizing nearly symmetric actions in video by revealing that standard image-to-video probes are inherently permutation-invariant and fail to leverage temporal order. It introduces STEP, a parameter-efficient probing approach that injects temporal sensitivity via learnable frame-wise positional encodings, a single global CLS token, and a simplified attention mechanism. STEP consistently outperforms probing baselines and PEFT methods across four datasets, achieving state-of-the-art results on IKEA-ASM and Drive&Act, and demonstrates strong data efficiency, including low-data SSv2 scenarios. The findings highlight the importance of explicit temporal modeling in light-weight probes and offer a practical, parameter-efficient path for transferring image foundation models to video action recognition.

Abstract

We study parameter-efficient image-to-video probing for the unaddressed challenge of recognizing nearly symmetric actions - visually similar actions that unfold in opposite temporal order (e.g., opening vs. closing a bottle). Existing probing mechanisms for image-pretrained models, such as DinoV2 and CLIP, rely on attention mechanism for temporal modeling but are inherently permutation-invariant, leading to identical predictions regardless of frame order. To address this, we introduce Self-attentive Temporal Embedding Probing (STEP), a simple yet effective approach designed to enforce temporal sensitivity in parameter-efficient image-to-video transfer. STEP enhances self-attentive probing with three key modifications: (1) a learnable frame-wise positional encoding, explicitly encoding temporal order; (2) a single global CLS token, for sequence coherence; and (3) a simplified attention mechanism to improve parameter efficiency. STEP outperforms existing image-to-video probing mechanisms by 3-15% across four activity recognition benchmarks with only 1/3 of the learnable parameters. On two datasets, it surpasses all published methods, including fully fine-tuned models. STEP shows a distinct advantage in recognizing nearly symmetric actions, surpassing other probing mechanisms by 9-19%. and parameter-heavier PEFT-based transfer methods by 5-15%. Code and models will be made publicly available.

Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions

TL;DR

Abstract

Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)