Table of Contents
Fetching ...

MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition

Binhua Huang, Wendong Yao, Shaowu Chen, Guoxin Wang, Qingyuan Wang, Soumyabrata Dev

TL;DR

MoCrop presents a training-free, motion-guided adaptive cropping pipeline that leverages Motion Vectors from compressed video to focus inference on motion-salient regions. The method comprises Merge & Denoise, Monte Carlo Sampling, and Motion Grid Search to identify a bounding region that is cropped consistently across I-frames, functioning as a lightweight preprocessor that can either reduce input resolution or improve accuracy without retraining backbones. Empirical results on UCF101 show substantial gains in accuracy or reductions in FLOPs across CNNs and Transformers, and the approach also enhances prior compressed-domain methods like CoViAR, demonstrating broad applicability and real-time viability. The work highlights practical efficiency-accuracy trade-offs and outlines limitations stemming from motion signals, offering avenues for future expansion with multi-region proposals and codec-agnostic motion cues.

Abstract

Standard video action recognition models often process typically resized full frames, suffering from spatial redundancy and high computational costs. To address this, we introduce MoCrop, a motion-aware adaptive cropping module designed for efficient video action recognition in the compressed domain. Leveraging Motion Vectors (MVs) naturally available in H.264 video, MoCrop localizes motion-dense regions to produce adaptive crops at inference without requiring any training or parameter updates. Our lightweight pipeline synergizes three key components: Merge & Denoise (MD) for outlier filtering, Monte Carlo Sampling (MCS) for efficient importance sampling, and Motion Grid Search (MGS) for optimal region localization. This design allows MoCrop to serve as a versatile "plug-and-play" module for diverse backbones. Extensive experiments on UCF101 demonstrate that MoCrop serves as both an accelerator and an enhancer. With ResNet-50, it achieves a +3.5% boost in Top-1 accuracy at equivalent FLOPs (Attention Setting), or a +2.4% accuracy gain with 26.5% fewer FLOPs (Efficiency Setting). When applied to CoViAR, it improves accuracy to 89.2% or reduces computation by roughly 27% (from 11.6 to 8.5 GFLOPs). Consistent gains across MobileNet-V3, EfficientNet-B1, and Swin-B confirm its strong generality and suitability for real-time deployment. Our code and models are available at https://github.com/microa/MoCrop.

MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition

TL;DR

MoCrop presents a training-free, motion-guided adaptive cropping pipeline that leverages Motion Vectors from compressed video to focus inference on motion-salient regions. The method comprises Merge & Denoise, Monte Carlo Sampling, and Motion Grid Search to identify a bounding region that is cropped consistently across I-frames, functioning as a lightweight preprocessor that can either reduce input resolution or improve accuracy without retraining backbones. Empirical results on UCF101 show substantial gains in accuracy or reductions in FLOPs across CNNs and Transformers, and the approach also enhances prior compressed-domain methods like CoViAR, demonstrating broad applicability and real-time viability. The work highlights practical efficiency-accuracy trade-offs and outlines limitations stemming from motion signals, offering avenues for future expansion with multi-region proposals and codec-agnostic motion cues.

Abstract

Standard video action recognition models often process typically resized full frames, suffering from spatial redundancy and high computational costs. To address this, we introduce MoCrop, a motion-aware adaptive cropping module designed for efficient video action recognition in the compressed domain. Leveraging Motion Vectors (MVs) naturally available in H.264 video, MoCrop localizes motion-dense regions to produce adaptive crops at inference without requiring any training or parameter updates. Our lightweight pipeline synergizes three key components: Merge & Denoise (MD) for outlier filtering, Monte Carlo Sampling (MCS) for efficient importance sampling, and Motion Grid Search (MGS) for optimal region localization. This design allows MoCrop to serve as a versatile "plug-and-play" module for diverse backbones. Extensive experiments on UCF101 demonstrate that MoCrop serves as both an accelerator and an enhancer. With ResNet-50, it achieves a +3.5% boost in Top-1 accuracy at equivalent FLOPs (Attention Setting), or a +2.4% accuracy gain with 26.5% fewer FLOPs (Efficiency Setting). When applied to CoViAR, it improves accuracy to 89.2% or reduces computation by roughly 27% (from 11.6 to 8.5 GFLOPs). Consistent gains across MobileNet-V3, EfficientNet-B1, and Swin-B confirm its strong generality and suitability for real-time deployment. Our code and models are available at https://github.com/microa/MoCrop.

Paper Structure

This paper contains 18 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: MoCrop pipeline. MVs identify SSR to guide I-frame cropping; cropped frames are fed to the action recognition model.
  • Figure 2: Qualitative visualization of the complete MoCrop pipeline across six representative UCF101 action classes. Each row shows one video processed through six stages: (a) Raw MVs: MVs extracted from H.264 compressed video, visualized as arrows showing magnitude and direction. (b) MD (Denoised): Top-1% MVs retained after percentile-based filtering. (c) MCS (Sampled): MVs after weighted importance sampling (10% of denoised MVs, biased toward high-magnitude motion). (d) MGS - Grid: Motion-density heatmap aggregated on $16{\times}9$ spatial grid. (e) MGS - Search: Optimal region identified by dual-objective scoring (white box overlaid on heatmap). (f) Comparison: Overlay comparison of Motion-Aware Crop (green box) vs. Center Crop (red box) on the same RGB frame, demonstrating MoCrop's ability to focus on action-relevant regions.
  • Figure 3: Failure mode analysis across five representative cases. (a) Motion-density map. (b) Heatmap overlay with search result (white box). (c) Motion-Aware Crop (green) vs. Center Crop (red). See text for detailed analysis.