Table of Contents
Fetching ...

EZSR: Event-based Zero-Shot Recognition

Yan Yang, Liyuan Pan, Dongxu Li, Liu Liu

TL;DR

This study develops an event encoder without relying on additional reconstruction networks, and achieves superior zero-shot object recognition performance on extensive standard benchmark datasets, even compared with past supervised learning approaches.

Abstract

This paper studies zero-shot object recognition using event camera data. Guided by CLIP, which is pre-trained on RGB images, existing approaches achieve zero-shot object recognition by optimizing embedding similarities between event data and RGB images respectively encoded by an event encoder and the CLIP image encoder. Alternatively, several methods learn RGB frame reconstructions from event data for the CLIP image encoder. However, they often result in suboptimal zero-shot performance. This study develops an event encoder without relying on additional reconstruction networks. We theoretically analyze the performance bottlenecks of previous approaches: the embedding optimization objectives are prone to suffer from the spatial sparsity of event data, causing semantic misalignments between the learned event embedding space and the CLIP text embedding space. To mitigate the issue, we explore a scalar-wise modulation strategy. Furthermore, to scale up the number of events and RGB data pairs for training, we also study a pipeline for synthesizing event data from static RGB images in mass. Experimentally, we demonstrate an attractive scaling property in the number of parameters and synthesized data. We achieve superior zero-shot object recognition performance on extensive standard benchmark datasets, even compared with past supervised learning approaches. For example, our model with a ViT/B-16 backbone achieves 47.84% zero-shot accuracy on the N-ImageNet dataset.

EZSR: Event-based Zero-Shot Recognition

TL;DR

This study develops an event encoder without relying on additional reconstruction networks, and achieves superior zero-shot object recognition performance on extensive standard benchmark datasets, even compared with past supervised learning approaches.

Abstract

This paper studies zero-shot object recognition using event camera data. Guided by CLIP, which is pre-trained on RGB images, existing approaches achieve zero-shot object recognition by optimizing embedding similarities between event data and RGB images respectively encoded by an event encoder and the CLIP image encoder. Alternatively, several methods learn RGB frame reconstructions from event data for the CLIP image encoder. However, they often result in suboptimal zero-shot performance. This study develops an event encoder without relying on additional reconstruction networks. We theoretically analyze the performance bottlenecks of previous approaches: the embedding optimization objectives are prone to suffer from the spatial sparsity of event data, causing semantic misalignments between the learned event embedding space and the CLIP text embedding space. To mitigate the issue, we explore a scalar-wise modulation strategy. Furthermore, to scale up the number of events and RGB data pairs for training, we also study a pipeline for synthesizing event data from static RGB images in mass. Experimentally, we demonstrate an attractive scaling property in the number of parameters and synthesized data. We achieve superior zero-shot object recognition performance on extensive standard benchmark datasets, even compared with past supervised learning approaches. For example, our model with a ViT/B-16 backbone achieves 47.84% zero-shot accuracy on the N-ImageNet dataset.
Paper Structure (33 sections, 1 theorem, 5 equations, 7 figures, 4 tables)

This paper contains 33 sections, 1 theorem, 5 equations, 7 figures, 4 tables.

Key Result

Lemma 1

When eq:baseline is effectively minimized, $\boldsymbol{x}^{\mathsf{img}}_{+} \cdot \boldsymbol{x}^{\mathsf{txt}}_{+} > \boldsymbol{x}^{\mathsf{img}}_{-} \cdot \boldsymbol{x}^{\mathsf{txt}}_{+}$ does not imply $\hat{\boldsymbol{x}}^{\mathsf{evt}} \cdot \boldsymbol{x}^{\mathsf{txt}}_{+} > \hat{\bolds

Figures (7)

  • Figure 1: Comparison of our accuracies (%) with respect to the second-best and third-best accuracies (%) from previous methods eventbindeventclipeclip on object nimagnetncaltechCIFAR-10-DVSnmnist and action hardvsdailyactionPAFbully10khmdbdvs recognition. Beside each axis, the dataset name is given.
  • Figure 2: Similarity distribution for (a) RGB and (b) event embeddings. RGB and event embeddings are extracted using a pre-trained CLIP model eva on the validation fold of ImageNet-1K imagenet and N-ImageNet nimagnet datasets. Cosine similarities are computed separately among the RGB and event embeddings, and the density is normalized by the maximum value.
  • Figure 3: Overview of our method. Our goal is to learn an event encoder $f^{\mathsf{evt}}(\cdot)$ to replace the image encoder $f^{\mathsf{img}}(\cdot)$ from a pre-trained CLIP for allowing zero-shot object recognition with event data. Given paired event and RGB images, we respectively extract the embeddings $\{\hat{\boldsymbol{x}}^{\mathsf{evt}}\}$ and $\{\boldsymbol{x}^{\mathsf{img}}\}$ from $f^{\mathsf{evt}}(\cdot)$ and $f^{\mathsf{img}}(\cdot)$ to optimize \ref{['eq:baseline']} and \ref{['eq:reg']}. The fire (i.e., ) and snowflake (i.e., ) emojis respectively denote trainable and frozen components.
  • Figure 4: Samples of synthetic event data. (a)/(c) are RGB images. (b)/(d) are event frames, where red and blue indicate positive and negative events, respectively.
  • Figure 5: Ablation study on the number of training epochs and percentage of training data. (a) Training epochs are varied from 10 to 60 with a step size of 10. (b) The percentage (%) of synthetic data used in training our method is varried from 20% to 100% with a step size of 20%.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Lemma 1
  • Remark 1