Table of Contents
Fetching ...

Advancing Compressed Video Action Recognition through Progressive Knowledge Distillation

Efstathia Soufleri, Deepak Ravikumar, Kaushik Roy

TL;DR

Compressed video action recognition benefits from multi-modality signals available in compressed streams (MV, Residuals, I-frames). The paper introduces Progressive Knowledge Distillation (PKD) to transfer knowledge across modalities via Internal Classifiers, guided by the observed hierarchy of flatter minima, and adds Weighted Inference with Scaled Ensemble (WISE) to improve runtime accuracy through learned cross-exit ensembling. PKD yields substantial gains in IC accuracy (up to 11.42%) and WISE provides additional improvements (up to 9.30%), all while enabling favorable compute/latency trade-offs against state-of-the-art compressed-domain methods. Together, PKD and WISE offer a scalable framework for efficient and accurate action recognition in compressed videos, with strong empirical evidence on UCF-101 and HMDB-51 and clear potential for broader deployment.

Abstract

Compressed video action recognition classifies video samples by leveraging the different modalities in compressed videos, namely motion vectors, residuals, and intra-frames. For this purpose, three neural networks are deployed, each dedicated to processing one modality. Our observations indicate that the network processing intra-frames tend to converge to a flatter minimum than the network processing residuals, which in turn converges to a flatter minimum than the motion vector network. This hierarchy in convergence motivates our strategy for knowledge transfer among modalities to achieve flatter minima, which are generally associated with better generalization. With this insight, we propose Progressive Knowledge Distillation (PKD), a technique that incrementally transfers knowledge across the modalities. This method involves attaching early exits (Internal Classifiers - ICs) to the three networks. PKD distills knowledge starting from the motion vector network, followed by the residual, and finally, the intra-frame network, sequentially improving IC accuracy. Further, we propose the Weighted Inference with Scaled Ensemble (WISE), which combines outputs from the ICs using learned weights, boosting accuracy during inference. Our experiments demonstrate the effectiveness of training the ICs with PKD compared to standard cross-entropy-based training, showing IC accuracy improvements of up to 5.87% and 11.42% on the UCF-101 and HMDB-51 datasets, respectively. Additionally, WISE improves accuracy by up to 4.28% and 9.30% on UCF-101 and HMDB-51, respectively.

Advancing Compressed Video Action Recognition through Progressive Knowledge Distillation

TL;DR

Compressed video action recognition benefits from multi-modality signals available in compressed streams (MV, Residuals, I-frames). The paper introduces Progressive Knowledge Distillation (PKD) to transfer knowledge across modalities via Internal Classifiers, guided by the observed hierarchy of flatter minima, and adds Weighted Inference with Scaled Ensemble (WISE) to improve runtime accuracy through learned cross-exit ensembling. PKD yields substantial gains in IC accuracy (up to 11.42%) and WISE provides additional improvements (up to 9.30%), all while enabling favorable compute/latency trade-offs against state-of-the-art compressed-domain methods. Together, PKD and WISE offer a scalable framework for efficient and accurate action recognition in compressed videos, with strong empirical evidence on UCF-101 and HMDB-51 and clear potential for broader deployment.

Abstract

Compressed video action recognition classifies video samples by leveraging the different modalities in compressed videos, namely motion vectors, residuals, and intra-frames. For this purpose, three neural networks are deployed, each dedicated to processing one modality. Our observations indicate that the network processing intra-frames tend to converge to a flatter minimum than the network processing residuals, which in turn converges to a flatter minimum than the motion vector network. This hierarchy in convergence motivates our strategy for knowledge transfer among modalities to achieve flatter minima, which are generally associated with better generalization. With this insight, we propose Progressive Knowledge Distillation (PKD), a technique that incrementally transfers knowledge across the modalities. This method involves attaching early exits (Internal Classifiers - ICs) to the three networks. PKD distills knowledge starting from the motion vector network, followed by the residual, and finally, the intra-frame network, sequentially improving IC accuracy. Further, we propose the Weighted Inference with Scaled Ensemble (WISE), which combines outputs from the ICs using learned weights, boosting accuracy during inference. Our experiments demonstrate the effectiveness of training the ICs with PKD compared to standard cross-entropy-based training, showing IC accuracy improvements of up to 5.87% and 11.42% on the UCF-101 and HMDB-51 datasets, respectively. Additionally, WISE improves accuracy by up to 4.28% and 9.30% on UCF-101 and HMDB-51, respectively.
Paper Structure (12 sections, 1 equation, 9 figures, 8 tables)

This paper contains 12 sections, 1 equation, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Backbone networks with ICs for each video modality.
  • Figure 2: Visualizing flatter minima resulting from PKD and the associated loss curvature.
  • Figure 3: PKD IC training Overview: 1) for epoch 0 till K, we perform KD between the IC and the final classifier (FC) of the MV backbone network, 2) for epoch K+1 till T, we perform KD between the IC block and FC of the R backbone network, 3) for epoch T+1 till M, we perform KD between the IC block and the FC of the I-frame backbone network. Note that, the parameters of the backbone networks for the MV, R, and I-frame (illustrated in yellow) are not updated during PKD. The ICs (illustrated in green) are trained independently.
  • Figure 4: WISE Inference Overview: the video sample is evaluated sequentially. The exits might be from different compressed video modality backbones. The previous IC predictions are combined with scaling factors $\beta$ into an ensemble. If the confidence of the prediction exceeds a certain threshold $\tau$, classification terminates. Otherwise, the next IC is evaluated.
  • Figure 5: Trade-off curve of our proposal and SOTA work comparison on UCF-101 and HMDB-51. SOTA works (marked in red) fall either in the low accuracy and low computation cost regime (lower-left of the plot) or in the high accuracy and computation cost regime (top-right of the plot). Our proposal uses two backbones, i.e. CoViAR and TEAM-NET, (marked in green) scales in accuracy and computational cost.
  • ...and 4 more figures