Table of Contents
Fetching ...

Information Elevation Network for Fast Online Action Detection

Sunah Min, Jinyoung Moon

TL;DR

This work designs an efficient and effective OAD network using IEUs, called an information elevation network (IEN), which outperforms state-of-the-art OAD methods using two-stream features based on RGB frames and optical flows and is the first attempt that considers the computational overhead for the practical use of OAD.

Abstract

Online action detection (OAD) is a task that receives video segments within a streaming video as inputs and identifies ongoing actions within them. It is important to retain past information associated with a current action. However, long short-term memory (LSTM), a popular recurrent unit for modeling temporal information from videos, accumulates past information from the previous hidden and cell states and the extracted visual features at each timestep without considering the relationships between the past and current information. Consequently, the forget gate of the original LSTM can lose the accumulated information relevant to the current action because it determines which information to forget without considering the current action. We introduce a novel information elevation unit (IEU) that lifts up and accumulate the past information relevant to the current action in order to model the past information that is especially relevant to the current action. To the best of our knowledge, our IEN is the first attempt that considers the computational overhead for the practical use of OAD. Through ablation studies, we design an efficient and effective OAD network using IEUs, called an information elevation network (IEN). Our IEN uses visual features extracted by a fast action recognition network taking only RGB frames because extracting optical flows requires heavy computation overhead. On two OAD benchmark datasets, THUMOS-14 and TVSeries, our IEN outperforms state-of-the-art OAD methods using only RGB frames. Furthermore, on the THUMOS-14 dataset, our IEN outperforms the state-of-the-art OAD methods using two-stream features based on RGB frames and optical flows.

Information Elevation Network for Fast Online Action Detection

TL;DR

This work designs an efficient and effective OAD network using IEUs, called an information elevation network (IEN), which outperforms state-of-the-art OAD methods using two-stream features based on RGB frames and optical flows and is the first attempt that considers the computational overhead for the practical use of OAD.

Abstract

Online action detection (OAD) is a task that receives video segments within a streaming video as inputs and identifies ongoing actions within them. It is important to retain past information associated with a current action. However, long short-term memory (LSTM), a popular recurrent unit for modeling temporal information from videos, accumulates past information from the previous hidden and cell states and the extracted visual features at each timestep without considering the relationships between the past and current information. Consequently, the forget gate of the original LSTM can lose the accumulated information relevant to the current action because it determines which information to forget without considering the current action. We introduce a novel information elevation unit (IEU) that lifts up and accumulate the past information relevant to the current action in order to model the past information that is especially relevant to the current action. To the best of our knowledge, our IEN is the first attempt that considers the computational overhead for the practical use of OAD. Through ablation studies, we design an efficient and effective OAD network using IEUs, called an information elevation network (IEN). Our IEN uses visual features extracted by a fast action recognition network taking only RGB frames because extracting optical flows requires heavy computation overhead. On two OAD benchmark datasets, THUMOS-14 and TVSeries, our IEN outperforms state-of-the-art OAD methods using only RGB frames. Furthermore, on the THUMOS-14 dataset, our IEN outperforms the state-of-the-art OAD methods using two-stream features based on RGB frames and optical flows.

Paper Structure

This paper contains 29 sections, 18 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison between the original LSTM and our information elevation unit (IEU). In this video segment, past information at times $-T$ and $-T+1$ is related to the current action. However, when processing the information at time $-T+1$, the LSTM considers only past information from the previous hidden and cell states and at $-T+1$ timestep. In the LSTM, there is a risk of removing accumulated information relevant to current action at the forget gate and accumulating information at the timestep that is irrelevant to current action at the input and output gates. Therefore, the proposed IEU takes the current information together with the past information as inputs and adds an elevation gate to maintain and accumulate the past information relevant the current action.
  • Figure 2: Architecture of the IEN. Taking a video segment consisting of $T+1$ chunks $V=\{c_t\}_{t=-T}^0$ as an input, IEN obtains each embedding vector by extracting visual features for each chunk and embedding the extracted features. The embedding vector is generated for each chunk and put into the IEU. At this time, the feature at $t=0$ representing the current action is entered together with the feature at each time $t$. Loss, $L_c$, and probabilities are calculated for the outputs of all IEUs, and the probabilities for the K action classes and the background in the last chunk are used to determine the current action.
  • Figure 3: Structure of the information elevation unit (IEU). The IEU is an extended LSTM by adding an additional elevation gate and taking additional input, current information $x_0$. The IEU’s forget gate (red box) is the same as the original LSTM and input and output gate (green box) is similar except the input $x_0$. The elevation gate (yellow box) is newly added. Merging lines implies the addition operation between vectors.
  • Figure 4: Three compared models for the ablation study. (a) original LSTM w/o-$x_0$ that does not contain $x_0$, (a) LSTM w/-$x_0$ in a naïve way that takes $h_{t-1}$, $x_t$, and $x_0$ in a bundle as input (3) LSTM w/-$x_0$ in a sophisticated way that uses $x_0$ instead of $x_t$ or $h_{t-1}$ by considering the role of each gate.
  • Figure 5: Qualitative evaluation of IEN on THUMOS-14 b22 and TVSeries b9. The frames painted in color represent actions occurring and the graph shown below represents the predicted action probabilities.