Tracking-Assisted Object Detection with Event Cameras

Ting-Kang Yen; Igor Morawski; Shusil Dangi; Kai He; Chung-Yi Lin; Jia-Fong Yeh; Hung-Ting Su; Winston Hsu

Tracking-Assisted Object Detection with Event Cameras

Ting-Kang Yen, Igor Morawski, Shusil Dangi, Kai He, Chung-Yi Lin, Jia-Fong Yeh, Hung-Ting Su, Winston Hsu

TL;DR

This paper introduces the visibility attribute of objects and contributes an auto-labeling algorithm to not only clean the existing event camera dataset but also append additional visibility labels to it, and exploits tracking strategies for pseudo-occluded objects to maintain their permanence and retain their bounding boxes.

Abstract

Event-based object detection has recently garnered attention in the computer vision community due to the exceptional properties of event cameras, such as high dynamic range and no motion blur. However, feature asynchronism and sparsity cause invisible objects due to no relative motion to the camera, posing a significant challenge in the task. Prior works have studied various implicit-learned memories to retain as many temporal cues as possible. However, implicit memories still struggle to preserve long-term features effectively. In this paper, we consider those invisible objects as pseudo-occluded objects and aim to detect them by tracking through occlusions. Firstly, we introduce the visibility attribute of objects and contribute an auto-labeling algorithm to not only clean the existing event camera dataset but also append additional visibility labels to it. Secondly, we exploit tracking strategies for pseudo-occluded objects to maintain their permanence and retain their bounding boxes, even when features have not been available for a very long time. These strategies can be treated as an explicit-learned memory guided by the tracking objective to record the displacements of objects across frames. Lastly, we propose a spatio-temporal feature aggregation module to enrich the latent features and a consistency loss to increase the robustness of the overall pipeline. We conduct comprehensive experiments to verify our method's effectiveness where still objects are retained, but real occluded objects are discarded. The results demonstrate that (1) the additional visibility labels can assist in supervised training, and (2) our method outperforms state-of-the-art approaches with a significant improvement of 7.9% absolute mAP.

Tracking-Assisted Object Detection with Event Cameras

TL;DR

Abstract

Paper Structure (31 sections, 6 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 31 sections, 6 equations, 6 figures, 12 tables, 1 algorithm.

Introduction
Related Works
Event Representations
Object Detection with Event Cameras
Multi-Object Tracking
Tracking by Detection
Joint Object Detection and Tracking
Method
Auto-Labeling for Still Objects
Calculation of Occupancy Rate
Calculation of Displacement
Continuity Maintenance
Event Representation
Spatio-Temporal Feature Aggregation
Joint Object Detection and Tracking
...and 16 more sections

Figures (6)

Figure 1: A video sequence from the 1 Megapixel Automotive Detection Dataset with absolute timestamps, ground truth labels, and some detection results from prior works. Both cars slow down from \ref{['fig:probdef1']} to \ref{['fig:probdef3']}, remain still from \ref{['fig:probdef4']} to \ref{['fig:probdef5']} and start moving at \ref{['fig:probdef6']}. Meanwhile, \ref{['fig:priorwork']} and \ref{['fig:tednet']} show the main difference between prior works and our TEDNet. Bounding boxes with different colors correspond to different categories.
Figure 2: The architecture of the proposed TEDNet. Spatio-Temporal Feature Aggregation integrates a 3D convolution with a recurrent neural network where $H_t = \emph{X3D}(I_t,\ H_{t-1})$. Joint Object Detection and Tracking consists of localization head $f_p$/map $P_t$, size head $f_s$/map $S_t$, offset head $f_{off}$/map $O_t$, displacement head $f_d$/map $D_t$, visibility head $f_v$/map $V_t$, and consistency head $f_c$/map $C_t$. The novel consistency map $C_t$ correlates the proportional relationship between $D_t$ and $V_t$ where the novel consistency loss $L_{con}$ offers a better regularization to harness the advantage of X3D and increase the robustness of the overall pipeline.
Figure 3: The visualization results of prior worksREDRVTDMANetHMNetCenterTrackPermaTrack and our TEDNet. Bounding boxes with different colors correspond to different categories. Red circles correspond to either false positives or false negatives. TEDNet achieves state-of-the-art mAP performance by retaining bounding boxes of still objects and discarding bounding boxes of real occluded objects. (Note: * shows the post-processed feature map after the spatio-temporal feature aggregation.)
Figure 4: A video sequence of the noisy and clean ground truth with absolute timestamps and ground truth labels. One car on the left remains still from the start of the video with no features, and human beings are unaware of the physical existence of this car. Hence, the bounding box of that car is removed according to the auto-labelling algorithm mentioned in \ref{['sec:autolabel']}. Bounding boxes with different colors correspond to different categories.
Figure 5: A video sequence of the clean ground truth with absolute timestamps and visibility labels. Two cars slow down, and the visibility labels are changed from 1.0 (green) to 0.0 (red). Bounding boxes with different colors correspond to different visibility (mobility).
...and 1 more figures

Tracking-Assisted Object Detection with Event Cameras

TL;DR

Abstract

Tracking-Assisted Object Detection with Event Cameras

Authors

TL;DR

Abstract

Table of Contents

Figures (6)