MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition
Binhua Huang, Wendong Yao, Shaowu Chen, Guoxin Wang, Qingyuan Wang, Soumyabrata Dev
TL;DR
MoCrop presents a training-free, motion-guided adaptive cropping pipeline that leverages Motion Vectors from compressed video to focus inference on motion-salient regions. The method comprises Merge & Denoise, Monte Carlo Sampling, and Motion Grid Search to identify a bounding region that is cropped consistently across I-frames, functioning as a lightweight preprocessor that can either reduce input resolution or improve accuracy without retraining backbones. Empirical results on UCF101 show substantial gains in accuracy or reductions in FLOPs across CNNs and Transformers, and the approach also enhances prior compressed-domain methods like CoViAR, demonstrating broad applicability and real-time viability. The work highlights practical efficiency-accuracy trade-offs and outlines limitations stemming from motion signals, offering avenues for future expansion with multi-region proposals and codec-agnostic motion cues.
Abstract
Standard video action recognition models often process typically resized full frames, suffering from spatial redundancy and high computational costs. To address this, we introduce MoCrop, a motion-aware adaptive cropping module designed for efficient video action recognition in the compressed domain. Leveraging Motion Vectors (MVs) naturally available in H.264 video, MoCrop localizes motion-dense regions to produce adaptive crops at inference without requiring any training or parameter updates. Our lightweight pipeline synergizes three key components: Merge & Denoise (MD) for outlier filtering, Monte Carlo Sampling (MCS) for efficient importance sampling, and Motion Grid Search (MGS) for optimal region localization. This design allows MoCrop to serve as a versatile "plug-and-play" module for diverse backbones. Extensive experiments on UCF101 demonstrate that MoCrop serves as both an accelerator and an enhancer. With ResNet-50, it achieves a +3.5% boost in Top-1 accuracy at equivalent FLOPs (Attention Setting), or a +2.4% accuracy gain with 26.5% fewer FLOPs (Efficiency Setting). When applied to CoViAR, it improves accuracy to 89.2% or reduces computation by roughly 27% (from 11.6 to 8.5 GFLOPs). Consistent gains across MobileNet-V3, EfficientNet-B1, and Swin-B confirm its strong generality and suitability for real-time deployment. Our code and models are available at https://github.com/microa/MoCrop.
