Event Camera Data Dense Pre-training

Yan Yang; Liyuan Pan; Liu Liu

Event Camera Data Dense Pre-training

Yan Yang, Liyuan Pan, Liu Liu

TL;DR

This work addresses pre-training neural networks for dense prediction tasks using event camera data, highlighting the inefficacy of transferring RGB-based pre-training due to sparse event images. It introduces a self-supervised framework with a teacher-student EMA that enforces patch-level, context-level, and image-level similarities, including an on-the-fly context mining via K-means to create discriminative contexts for learning. A synthetic E-TartanAir dataset provides diverse scenes and motions for robust pre-training. Across semantic segmentation, optical flow, and depth estimation, the method achieves state-of-the-art results, demonstrating the value of context-aware, event-only pre-training and offering potential for future event-based foundation models.

Abstract

This paper introduces a self-supervised learning framework designed for pre-training neural networks tailored to dense prediction tasks using event camera data. Our approach utilizes solely event data for training. Transferring achievements from dense RGB pre-training directly to event camera data yields subpar performance. This is attributed to the spatial sparsity inherent in an event image (converted from event data), where many pixels do not contain information. To mitigate this sparsity issue, we encode an event image into event patch features, automatically mine contextual similarity relationships among patches, group the patch features into distinctive contexts, and enforce context-to-context similarities to learn discriminative event features. For training our framework, we curate a synthetic event camera dataset featuring diverse scene and motion patterns. Transfer learning performance on downstream dense prediction tasks illustrates the superiority of our method over state-of-the-art approaches.

Event Camera Data Dense Pre-training

TL;DR

Abstract

Paper Structure (32 sections, 5 equations, 7 figures, 7 tables)

This paper contains 32 sections, 5 equations, 7 figures, 7 tables.

Introduction
Related Works
RGB image self-supervised learning.
Event image self-supervised learning.
Event datasets.
Method
Overall architecture.
Event image augmentations.
Patch-level similarity.
Context-level similarity.
Image-level similarity.
Pre-training objective.
Experiments
Implementation details.
Baselines.
...and 17 more sections

Figures (7)

Figure 1: Comparison of our scores with respect to the second-best and third-best scores for semantic segmentation ddd17evsegnetdsecess, optical flow estimation mvsecdseceraft, and depth estimation mvsec. Superscripts besides evaluation metrics are used to differentiate benchmark datasets for a specific task.
Figure 2: Overall architecture. During pre-training, our approach takes an event image $\mathbf{x}^{+}$ and its affine-transformed counterpart $\mathbf{x}^{\ast}$ as inputs, producing a pre-trained backbone network $\mathcal{F}_{s}$. A teacher network (colored by red boxes) and a student network are employed in the self-supervised training stage. Event images $\mathbf{x}^{+}$ and $\mathbf{x}^{\ast}$ are tiled into $N$ patches, denoted as $\mathbf{x}^{+}=\{\boldsymbol{x}^{+}_{i}\}$ and $\mathbf{x}^{\ast}=\{\boldsymbol{x}^{\ast}_{i}\}, {i=1,...,N}$. We randomly mask some patches of $\mathbf{x}^{\ast}$ given to the student, but leave $\mathbf{x}^{\ast}$ intact for the teacher. Patch-wise binary masks are represented by $\mathbf{m}=\{{m}_{i}\}$. Three similarity constraints are imposed based on output patch-wise features from the student and teacher backbones, respectively. They are: i) patch-level similarity. Patch-wise features of masked $\mathbf{x}^{\ast}$ and $\mathbf{x}^{\ast}$ are separately projected by heads $\mathcal{H}_{s}^{\mathsf{m}}$ in the student network and $\mathcal{H}_{t}^{\mathsf{m}}$ in the teacher network, obtaining embeddings $\{\boldsymbol{s}_{i}\}$ and $\{\boldsymbol{t}_{i}\}$. To reconstruct masked patch embeddings, we employ a cross-entropy loss $\mathcal{L}_{\text{patch}}$; ii) context-level similarity. Features $\{\boldsymbol{z}_{i}^{+}\}$ from the teacher network are assigned to $K$ contexts, obtaining assignments $\{a_k(\boldsymbol{z}_{i}^{+})\}$. $a_k(\boldsymbol{z}_{i}^{+})$ denotes the membership of the feature $\boldsymbol{z}_{i}^{+}$ to $k$-th context. The assignments of student features $\{\boldsymbol{z}^{\ast}_{i}\}$ are computed by directly transferring $a_k(\boldsymbol{z}_{i}^{+})$ with an affine transformation. With the assignments $\{a_k(\boldsymbol{z}_{i}^{+})\}$ and $\{a_k(\boldsymbol{z}^{\ast}_{i})\}$, we collect and pool all features assigned to each context using heads $\mathcal{H}_{s}^{\mathsf{c}}$ and $\mathcal{H}_{t}^{\mathsf{c}}$, generating context embeddings $\{\boldsymbol{s}_{k}\}$ and $\{\boldsymbol{t}_{k}\}$. A cross-entropy loss $\mathcal{L}_{\text{context}}$ is used to learn masked context embeddings. The forward passes from $\mathbf{x}^{+}$ are colored in blue, and the blocked lines mean crosslines; iii) image-level similarity. $\{\boldsymbol{z}^{\ast}_{i}\}$ and $\{\boldsymbol{z}_{i}^{+}\}$ are initially pooled separately and subsequently projected by the heads $\mathcal{H}_{s}^{\mathsf{img}}$ and $\mathcal{H}_{t}^{\mathsf{img}}$ into global image embeddings $\boldsymbol{s}^{\mathsf{img}}$ and $\boldsymbol{t}^{\mathsf{img}}$. A cross-entropy loss $\mathcal{L}_{\text{image}}$ is used to encourage image-level similarity.
Figure 3: Context assignment and aggregation. Given patch features $\{\boldsymbol{z}^{\ast}_{i}\}$ and $\{\boldsymbol{z}_{i}^{+}\}$, we perform K-means clustering to mine $K$ contexts, and obtain the patch-to-context assignments $\{a_k(\boldsymbol{z}^{\ast}_{i})\}$ and $\{a_k(\boldsymbol{z}_{i}^{+})\}$, respectively. For the $k$-th context, $\{\boldsymbol{z}^{+}_{i}\}$ assigned to it $\{a_k(\boldsymbol{z}^{+}_{i})=1|i=1,...,N\}$ are pooled into a context embeddings $\boldsymbol{t}_k$. Similarly, $\{\boldsymbol{z}_{i}^{\ast}\}$ are pooled into context embeddings $\{\boldsymbol{s}_k\}$. The red box and blue lines denote components of our teacher network and forward passes of $\{\boldsymbol{z}_{i}^{+}\}$, respectively.
Figure 4: Qualitative comparison examples of dense predictions, namely, semantic segmentation (1$^\text{st}$-2$^\text{nd}$ rows), optical flow estimation (3$^\text{rd}$-4$^\text{th}$ rows), and depth estimation (5$^\text{th}$-6$^\text{th}$ rows). (a) and (d): event images. Red and blue pixels depict positive and negative events, respectively. (b) and (e): ground-truth labels. (c) and (f): our model predictions. The brightness of depth maps in the 5$^\text{th}$ row of (b) and (c) is enhanced for visualization.
Figure 5: Comparison of the number of pre-training epochs.
...and 2 more figures

Event Camera Data Dense Pre-training

TL;DR

Abstract

Event Camera Data Dense Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (7)