ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

Hyolim Kang; Jeongseok Hyun; Joungbin An; Youngjae Yu; Seon Joo Kim

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

Hyolim Kang, Jeongseok Hyun, Joungbin An, Youngjae Yu, Seon Joo Kim

TL;DR

This paper tackles Online Temporal Action Localization in streaming videos with overlapping actions and no reliance on predefined action classes. It introduces ActionSwitch, a class-agnostic On-TAL framework built on a multi-switch finite-state machine and a state-emitting OAD model to produce online action states, enabling instant, boundary-aware instance generation. A Conservativeness loss is proposed to penalize unnecessary state changes, stabilizing long action proposals and reducing fragmentation. Evaluations on THUMOS14, FineAction, Epic-Kitchens 100, and Multithumos show ActionSwitch achieving state-of-the-art results among On-TAL approaches and competitive performance against ODAS, with strong open-world potential when combined with video-language models.

Abstract

Online Temporal Action Localization (On-TAL) is a critical task that aims to instantaneously identify action instances in untrimmed streaming videos as soon as an action concludes -- a major leap from frame-based Online Action Detection (OAD). Yet, the challenge of detecting overlapping actions is often overlooked even though it is a common scenario in streaming videos. Current methods that can address concurrent actions depend heavily on class information, limiting their flexibility. This paper introduces ActionSwitch, the first class-agnostic On-TAL framework capable of detecting overlapping actions. By obviating the reliance on class information, ActionSwitch provides wider applicability to various situations, including overlapping actions of the same class or scenarios where class information is unavailable. This approach is complemented by the proposed "conservativeness loss", which directly embeds a conservative decision-making principle into the loss function for On-TAL. Our ActionSwitch achieves state-of-the-art performance in complex datasets, including Epic-Kitchens 100 targeting the challenging egocentric view and FineAction consisting of fine-grained actions.

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 2 equations, 4 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Streaming Video Understanding
Online Temporal Action Localization
Class-agnostic Detection
Methodology
Problem Setting
State-emitting OAD Model
Conservativeness Loss
Experiments
Datasets and Features
Evaluation Metric
Implementation Details
Main Results
Ablation Studies
...and 3 more sections

Figures (4)

Figure 1: Overview of the ActionSwitch Framework: State label is derived from the sum of the ids of activated switches. For example, the state is labeled as '3' between t2 and t3 when switches '1' and '2' are simultaneously active, whereas it registers as '2' from t3 to t4 when only switch '2' is active. State changes signify action instance boundaries, and our 'conservativeness loss' minimizes state fluctuations to improve detection accuracy.
Figure 2: (a) State diagram of ActionSwitch framework. Some connections are omitted for simplicity. (b) Overall architecture of state-emitting OAD model.
Figure 3: Training process in ActionSwitch. $CE$ and $\mathcal{L}_c$ denote the terms in Eq. \ref{['eq:loss_function']}. GT state and conservative pseudo-state are used for training. GT states are encoded from GT action instances while the pseudo-states come from the model's own predictions. At t6, action 1 is temporarily lost and results in the fragmentation of the action instance. However, with our conservativeness loss, the output of action instances becomes robust against such fragmentation (t6) and noisy output (t11).
Figure 4: Qualitative results of On-TAL models. If the action instances are overlapped, they are placed in other lines. We show ground-truth (GT) and the output of 2-state (2S), 4-state (4S), conservative loss (Cons), SimOn simon and CAG-QIL cagqil. Refer to Sec. \ref{['sec:qual_comp']} for an analysis of four cases (C1$\sim$C4) which are annotated by the red boxes.

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

TL;DR

Abstract

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (4)