Table of Contents
Fetching ...

Density-Guided Label Smoothing for Temporal Localization of Driving Actions

Tunc Alkanat, Erkut Akdag, Egor Bondarev, Peter H. N. De With

TL;DR

This work develops a density-guided label smoothing technique based on label probability distributions to facilitate better learning from boundary video-segments that typically include multiple labels and designs a post-processing step to efficiently fuse information from video- segments and multiple camera views into scene-level predictions, which facilitates elimination of false positives.

Abstract

Temporal localization of driving actions plays a crucial role in advanced driver-assistance systems and naturalistic driving studies. However, this is a challenging task due to strict requirements for robustness, reliability and accurate localization. In this work, we focus on improving the overall performance by efficiently utilizing video action recognition networks and adapting these to the problem of action localization. To this end, we first develop a density-guided label smoothing technique based on label probability distributions to facilitate better learning from boundary video-segments that typically include multiple labels. Second, we design a post-processing step to efficiently fuse information from video-segments and multiple camera views into scene-level predictions, which facilitates elimination of false positives. Our methodology yields a competitive performance on the A2 test set of the naturalistic driving action recognition track of the 2022 NVIDIA AI City Challenge with an F1 score of 0.271.

Density-Guided Label Smoothing for Temporal Localization of Driving Actions

TL;DR

This work develops a density-guided label smoothing technique based on label probability distributions to facilitate better learning from boundary video-segments that typically include multiple labels and designs a post-processing step to efficiently fuse information from video- segments and multiple camera views into scene-level predictions, which facilitates elimination of false positives.

Abstract

Temporal localization of driving actions plays a crucial role in advanced driver-assistance systems and naturalistic driving studies. However, this is a challenging task due to strict requirements for robustness, reliability and accurate localization. In this work, we focus on improving the overall performance by efficiently utilizing video action recognition networks and adapting these to the problem of action localization. To this end, we first develop a density-guided label smoothing technique based on label probability distributions to facilitate better learning from boundary video-segments that typically include multiple labels. Second, we design a post-processing step to efficiently fuse information from video-segments and multiple camera views into scene-level predictions, which facilitates elimination of false positives. Our methodology yields a competitive performance on the A2 test set of the naturalistic driving action recognition track of the 2022 NVIDIA AI City Challenge with an F1 score of 0.271.
Paper Structure (14 sections, 7 equations, 4 figures, 2 tables)

This paper contains 14 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Diagram of joint task of action classification and localization. From top to bottom: example images of the phone call, reaching behind, and pick up from floor action classes. The graph shows an ideal output, where not only classes but also the start and end times ($t_s[i]$, $t_e[i]$) of actions are visible.
  • Figure 2: Overview of the methodology. Left: multi-class, multi-view structured mini-batching ensures a training without bias. Every stream is a camera view. Middle: the network is trained using cross-entropy loss, where the target probabilities are computed individually for every segment according to the class density within that segment. Right: localization and post-processing concatenates the class probabilities of streams into the scene class probabilities by fusing streams. Then, the 1-D probability signal is analyzed for peaks for every class in a scene. Significant peaks are considered predictions and are further refined by eliminating the overlapping predictions. Note that the modules shown with a dashed border are used for training only.
  • Figure 3: Bar chart depicting the effects of the temperature parameter $\beta$ and $f(x, y_s)$ on the computed density-guided smooth labels, shown as the bar height. For this example, $N_c=3$ and $T_c=64$ and $\beta$ is set to 10, 20, and 30, from top to bottom. The number of frames for each class is given along the horizontal axis.
  • Figure 4: Experimental results of our methodology. From top left to bottom: example images of texting by the right hand, pick up from the floor on the driver side, pick up from the floor from the passenger side, hand on head action classes. From top right to bottom: graphs show the predicted start and end times ($t_s(i)$, $t_e(i)$) for each class of the video clip. After obtaining the concatenated stream probabilities for each frame within the video clip, we apply the localization and post-processing steps to locate each class's start and end time.