Table of Contents
Fetching ...

CAKE: Real-time Action Detection via Motion Distillation and Background-aware Contrastive Learning

Hieu Hoang, Dung Trung Tran, Hong Nguyen, Nam-Phong Nguyen

Abstract

Online Action Detection (OAD) systems face two primary challenges: high computational cost and insufficient modeling of discriminative temporal dynamics against background motion. Adding optical flow could provides strong motion cues but it incurs significant computational overhead. We propose CAKE, a OAD Flow-based distillation framework to transfer motion knowledge into RGB models. We propose Dynamic Motion Adapter (DMA) to suppress static background noise and emphasize pixel changes, effectively approximating optical flow without explicit computation. The framework also integrates a Floating Contrastive Learning strategy to distinguish informative motion dynamics from temporal background. Various experiments conducted on the TVSeries, THUMOS'14, Kinetics-400 datasets show effectiveness of our model. CAKE achieves a standout mAP compared with SOTA while using the same backbone. Our model operates at over 72 FPS on a single CPU, making it highly suitable for resource-constrained systems.

CAKE: Real-time Action Detection via Motion Distillation and Background-aware Contrastive Learning

Abstract

Online Action Detection (OAD) systems face two primary challenges: high computational cost and insufficient modeling of discriminative temporal dynamics against background motion. Adding optical flow could provides strong motion cues but it incurs significant computational overhead. We propose CAKE, a OAD Flow-based distillation framework to transfer motion knowledge into RGB models. We propose Dynamic Motion Adapter (DMA) to suppress static background noise and emphasize pixel changes, effectively approximating optical flow without explicit computation. The framework also integrates a Floating Contrastive Learning strategy to distinguish informative motion dynamics from temporal background. Various experiments conducted on the TVSeries, THUMOS'14, Kinetics-400 datasets show effectiveness of our model. CAKE achieves a standout mAP compared with SOTA while using the same backbone. Our model operates at over 72 FPS on a single CPU, making it highly suitable for resource-constrained systems.
Paper Structure (13 sections, 6 equations, 4 figures, 5 tables)

This paper contains 13 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the proposed CAKE framework and DMA. (a) During training, an optical-flow teacher guides the DMA via cross-modal distillation. At inference, the RGB backbone splits into a static branch and a DMA motion branch, whose features are fused before temporal modeling with a GRU. (b) ODConv3D mechanism used in DMA, where base kernels $W_i$ are modulated by dynamic attention weights along kernel ($a_w$), channel ($a_f, a_c$), spatial ($a_s$), and temporal ($a_t$) dimensions.
  • Figure 2: Overview of the MoCo mechanism. The query encoder is updated by backpropagation, while the momentum encoder is updated using Exponential Moving Average (EMA) to maintain a consistent dictionary queue.
  • Figure 3: Qualitative analysis using Grad-CAM. From left to right: (a) Original RGB frames, (b) Teacher trained on Optical Flow, (c) RGB Student with static Conv3D, and (d) Our DMA with ODConv3D.
  • Figure 4: t-SNE visualization of the learned feature space on THUMOS'14. (a) Training features showing clear semantic clustering produced by Floating SupCon. (b) Test features demonstrating generalization to unseen samples.