Multi-model learning by sequential reading of untrimmed videos for action recognition
Kodai Kamiya, Toru Tamaki
TL;DR
The paper tackles action recognition from untrimmed videos by addressing clip correlation during sequential reading and proposes end‑to‑end learning through sequential clip processing with multiple model replicas synchronized by federated learning. It introduces a training scheme where M replicas process sequential clips in turn and are partially synchronized via a momentum parameter $\alpha$, enabling end‑to‑end learning without precomputed clip features. Across UCF101, HMDB51, and MPII Cooking, the approach yields performance gains over non‑synchronized baselines, with best results for small model counts (2–3) and carefully tuned $\alpha$, while sequential sampling improves efficiency relative to random sampling. The findings demonstrate a practical path to effective long‑video action recognition, balancing accuracy and computation, and highlight trade‑offs in synchronization and model count for real‑world deployment.
Abstract
We propose a new method for learning videos by aggregating multiple models by sequentially extracting video clips from untrimmed video. The proposed method reduces the correlation between clips by feeding clips to multiple models in turn and synchronizes these models through federated learning. Experimental results show that the proposed method improves the performance compared to the no synchronization.
