Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

Shristi Das Biswas; Efstathia Soufleri; Arani Roy; Kaushik Roy

Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

Shristi Das Biswas, Efstathia Soufleri, Arani Roy, Kaushik Roy

TL;DR

This work addresses the decoding bottleneck and temporal redundancy in video action recognition by operating directly in the compressed domain, leveraging I-frames, motion vectors, and residuals. It introduces a hybrid end-to-end architecture with a dual-encoder plus Spiking Temporal Modulator, a unified transformer, and a Multi-Modal Mixer to capture cross-modal spatiotemporal context while dramatically reducing inference cost and energy consumption. By accumulating P-frames, employing a biologically inspired STM with dynamic thresholds, and fusing modalities through a unified context, the approach achieves state-of-the-art or competitive accuracy on five benchmarks with substantial efficiency gains. The design provides practical guidance for efficient next-generation spatiotemporal learners suitable for edge deployment and energy-constrained settings.

Abstract

Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of $56$ while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330\times$ versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame -- P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs ($0.73$J/V) and fast inference ($16$V/s). Our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners. Code is available.

Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

TL;DR

Abstract

Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)