Table of Contents
Fetching ...

Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

Shristi Das Biswas, Efstathia Soufleri, Arani Roy, Kaushik Roy

TL;DR

This work addresses the decoding bottleneck and temporal redundancy in video action recognition by operating directly in the compressed domain, leveraging I-frames, motion vectors, and residuals. It introduces a hybrid end-to-end architecture with a dual-encoder plus Spiking Temporal Modulator, a unified transformer, and a Multi-Modal Mixer to capture cross-modal spatiotemporal context while dramatically reducing inference cost and energy consumption. By accumulating P-frames, employing a biologically inspired STM with dynamic thresholds, and fusing modalities through a unified context, the approach achieves state-of-the-art or competitive accuracy on five benchmarks with substantial efficiency gains. The design provides practical guidance for efficient next-generation spatiotemporal learners suitable for edge deployment and energy-constrained settings.

Abstract

Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of $56$ while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330\times$ versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame -- P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs ($0.73$J/V) and fast inference ($16$V/s). Our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners. Code is available.

Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

TL;DR

This work addresses the decoding bottleneck and temporal redundancy in video action recognition by operating directly in the compressed domain, leveraging I-frames, motion vectors, and residuals. It introduces a hybrid end-to-end architecture with a dual-encoder plus Spiking Temporal Modulator, a unified transformer, and a Multi-Modal Mixer to capture cross-modal spatiotemporal context while dramatically reducing inference cost and energy consumption. By accumulating P-frames, employing a biologically inspired STM with dynamic thresholds, and fusing modalities through a unified context, the approach achieves state-of-the-art or competitive accuracy on five benchmarks with substantial efficiency gains. The design provides practical guidance for efficient next-generation spatiotemporal learners suitable for edge deployment and energy-constrained settings.

Abstract

Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame -- P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs (J/V) and fast inference (V/s). Our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners. Code is available.

Paper Structure

This paper contains 13 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: We visualize two "Jumping Jack" videos in t-SNE space van2008visualizing using their RGB and compressed domain (CD) (MV and R) representations to investigate learning benefits in CD space. Curves showing video trajectories. While in joint RGB space the two videos are clearly separated, in MV and R space they overlap (row3). This suggests that a RGB-image based model needs to learn the two patterns separately, while a CD-based model sees a shared representation for videos of the same action, benefiting training and generalization. Also note that the two ways of the RGB trajectories overlap, showing that they cannot distinguish between the up- and down-moving motion. In contrast, CD signals preserve motion. The trajectories thus form circles instead of going back and forth on the same path (row1, row2).
  • Figure 2: Original motion vectors and residuals capture only inter-frame changes, often with low signal-to-noise ratios, making them difficult to model. In contrast, accumulated P-frames aggregate longer-term differences, revealing clearer temporal patterns.
  • Figure 3: Overview of the unrolled dynamics of our Spiking Temporal Modulator Unit. (a) MV and Residual inputs across time are (b) aggregated and modulated by synaptic weights to be integrated as current influx in the membrane potential for each STM. The firing threshold $v^l_{th}$ and the leak factor $\lambda^l$ are dynamically updated during training to attain best possible performance.
  • Figure 4: Overview of our method for unified representation learning in compressed videos. Given a set of compressed inputs, our factorized encoder modules sequentially aggregate rich spatio-temporal embeddings. These then interact to capture both local and global dependencies across the modalities and are fused using a specially designed Multi-Modal Mixer Block at different levels of granularity.
  • Figure 5: Visualization of significant features extracted by our method to the input space across multiple action classes for the UCF-101 dataset (left) and K400 dataset (right). Our proposed model learns to focus on relevant parts of the video for classification across modalities.
  • ...and 1 more figures