Multi-model learning by sequential reading of untrimmed videos for action recognition

Kodai Kamiya; Toru Tamaki

Multi-model learning by sequential reading of untrimmed videos for action recognition

Kodai Kamiya, Toru Tamaki

TL;DR

The paper tackles action recognition from untrimmed videos by addressing clip correlation during sequential reading and proposes end‑to‑end learning through sequential clip processing with multiple model replicas synchronized by federated learning. It introduces a training scheme where M replicas process sequential clips in turn and are partially synchronized via a momentum parameter $\alpha$, enabling end‑to‑end learning without precomputed clip features. Across UCF101, HMDB51, and MPII Cooking, the approach yields performance gains over non‑synchronized baselines, with best results for small model counts (2–3) and carefully tuned $\alpha$, while sequential sampling improves efficiency relative to random sampling. The findings demonstrate a practical path to effective long‑video action recognition, balancing accuracy and computation, and highlight trade‑offs in synchronization and model count for real‑world deployment.

Abstract

We propose a new method for learning videos by aggregating multiple models by sequentially extracting video clips from untrimmed video. The proposed method reduces the correlation between clips by feeding clips to multiple models in turn and synchronizes these models through federated learning. Experimental results show that the proposed method improves the performance compared to the no synchronization.

Multi-model learning by sequential reading of untrimmed videos for action recognition

TL;DR

, enabling end‑to‑end learning without precomputed clip features. Across UCF101, HMDB51, and MPII Cooking, the approach yields performance gains over non‑synchronized baselines, with best results for small model counts (2–3) and carefully tuned

, while sequential sampling improves efficiency relative to random sampling. The findings demonstrate a practical path to effective long‑video action recognition, balancing accuracy and computation, and highlight trade‑offs in synchronization and model count for real‑world deployment.

Abstract

Paper Structure (24 sections, 14 equations, 3 figures, 3 tables)

This paper contains 24 sections, 14 equations, 3 figures, 3 tables.

Introduction
Related Work
Action Recognition
Effects of data shuffling
Learning with long video
Method
Data
Creating clips
Model input
Training in a normal case.
Training in our case.
Model synchronization by federated learning
Experiments
Experimental Setup
Trimmed video datasets
...and 9 more sections

Figures (3)

Figure 1:
Figure 2:
Figure 4: The proposed method for reading video clips with multiple models.

Multi-model learning by sequential reading of untrimmed videos for action recognition

TL;DR

Abstract

Multi-model learning by sequential reading of untrimmed videos for action recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)