LEOD: Label-Efficient Object Detection for Event Cameras

Ziyi Wu; Mathias Gehrig; Qing Lyu; Xudong Liu; Igor Gilitschenski

LEOD: Label-Efficient Object Detection for Event Cameras

Ziyi Wu, Mathias Gehrig, Qing Lyu, Xudong Liu, Igor Gilitschenski

TL;DR

This work presents LEOD, the first method for label-efficient event-based detection, which unifies weakly- and semi-supervised object detection with a self-training mech-anism and consistently outperforms supervised baselines across various labeling ratios.

Abstract

Object detection with event cameras benefits from the sensor's low latency and high dynamic range. However, it is costly to fully label event streams for supervised training due to their high temporal resolution. To reduce this cost, we present LEOD, the first method for label-efficient event-based detection. Our approach unifies weakly- and semi-supervised object detection with a self-training mechanism. We first utilize a detector pre-trained on limited labels to produce pseudo ground truth on unlabeled events. Then, the detector is re-trained with both real and generated labels. Leveraging the temporal consistency of events, we run bi-directional inference and apply tracking-based post-processing to enhance the quality of pseudo labels. To stabilize training against label noise, we further design a soft anchor assignment strategy. We introduce new experimental protocols to evaluate the task of label-efficient event-based detection on Gen1 and 1Mpx datasets. LEOD consistently outperforms supervised baselines across various labeling ratios. For example, on Gen1, it improves mAP by 8.6% and 7.8% for RVT-S trained with 1% and 2% labels. On 1Mpx, RVT-S with 10% labels even surpasses its fully-supervised counterpart using 100% labels. LEOD maintains its effectiveness even when all labeled data are available, reaching new state-of-the-art results. Finally, we show that our method readily scales to improve larger detectors as well. Code is released at https://github.com/Wuziyi616/LEOD

LEOD: Label-Efficient Object Detection for Event Cameras

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 10 figures, 4 tables)

This paper contains 19 sections, 1 equation, 10 figures, 4 tables.

Introduction
Related Work
Method
Problem Formulation
LEOD: A Self-Training Framework
Towards High-Quality Pseudo Labeling
Experiments
Experimental Setup
Label-Efficient Results
Fully-Labeled Results
Ablation Studies
Conclusion
More Implementation Details
Tracking-based Post-Processing
RVT Training
...and 4 more sections

Figures (10)

Figure 1: Detection performance of LEOD and baselines trained only on labeled events or conducting naive self-training. Under the weakly-supervised setting, our method consistently improves the RVT-S detector RVT across all labeling ratios on the Gen1 dataset.
Figure 2: Illustration of two label-efficient event-based object detection settings: (a) weakly-supervised where all event sequences are sparsely annotated, and (b) semi-supervised where some event sequences are densely annotated, and others are fully unlabeled. We visualize both positive and negative events in black.
Figure 3: Overview of our LEOD pipeline.⓪ We first pre-train an event-based object detector on event streams with limited labels. ① To leverage the temporal information, we apply time-flip Test-Time Augmentation (TTA) to unlabeled event streams and ensemble the model predictions. ② We then apply forward and backward tracking to identify temporally inconsistent bounding boxes, i.e., boxes associated with short tracks. ③ To handle noisy labels, a soft anchor assignment strategy is designed to ignore detection loss on unconfident pseudo labels ( red boxes). ④ We can boost the model performance by self-training on reliable pseudo labels ( blue boxes) and repeating ① -- ④.
Figure 4: Illustration of the time-flip TTA which enhances our robustness against different object motions. Forward helps detect receding objects, while Backward helps with approaching objects.
Figure 5: Analysis on confidence thresholds. We randomly sample 10,000 predicted boxes from RVT-S pre-trained on 5% of Gen1 labels. We plot the pseudo labels' precision and recall of (a) cars and (b) pedestrians. In (c) and (d), we show each box's predicted confidence scores and its true IoU with ground-truth boxes. $\tau_\text{hard}$ is the threshold for initial filtering, and $\tau_\text{soft}$ is used in soft anchor assignment.
...and 5 more figures

LEOD: Label-Efficient Object Detection for Event Cameras

TL;DR

Abstract

LEOD: Label-Efficient Object Detection for Event Cameras

Authors

TL;DR

Abstract

Table of Contents

Figures (10)