Table of Contents
Fetching ...

Expanding Event Modality Applications through a Robust CLIP-Based Encoder

Sungheon Jeong, Hanning Chen, Sanggeon Yun, Suhyeon Cho, Wenjun Huang, Xiangjian Liu, Mohsen Imani

TL;DR

This work addresses the scarcity of large event datasets by transferring CLIP's image-text alignment to the event modality through a robust event encoder $f_E$ initialized from the image encoder $f_I$ and trained alongside a fixed $f_I$ and $f_T$. It represents events as a single-frame grayscale input $E$ derived via $E(x,y) = \sum_p\sum_t E(x,y,t,p)$ with $E = \frac{E(x,y)}{\max(E(x,y)) + 1}$, and optimizes with a composite loss $L = L_{ct} + \alpha L_{zs} + L_{kl}$ (plus $L_{pred}$ for fine-tuning) to preserve CLIP's capabilities while learning event-specific features. The approach achieves state-of-the-art object recognition, strong zero-shot and few-shot performance, generalizes to video-extracted events, and enables cross-modal retrieval and multi-modal interaction across five modalities (Image, Event, Text, Sound, Depth). These results demonstrate the viability of cross-modal learning with limited event data and broaden the practical impact of event modality in anomaly detection, retrieval, and integrated sensing tasks.

Abstract

This paper introduces a powerful encoder that transfers CLIP`s capabilities to event-based data, enhancing its utility and expanding its applicability across diverse domains. While large-scale datasets have significantly advanced image-based models, the scarcity of comprehensive event datasets has limited performance potential in event modality. To address this challenge, we adapt CLIP`s architecture to align event embeddings with image embeddings, supporting zero-shot learning and preserving text alignment while mitigating catastrophic forgetting. Our encoder achieves strong performance in object recognition, with competitive results in zero-shot and few-shot learning tasks. Notably, it generalizes effectively to events extracted from video data without requiring additional training, highlighting its versatility. Additionally, we integrate this encoder within a cross-modality framework that facilitates interaction across five modalities-Image, Event, Text, Sound, and Depth-expanding the possibilities for cross-modal applications. Overall, this work underscores the transformative potential of a robust event encoder, broadening the scope and utility of event-based data across various fields.

Expanding Event Modality Applications through a Robust CLIP-Based Encoder

TL;DR

This work addresses the scarcity of large event datasets by transferring CLIP's image-text alignment to the event modality through a robust event encoder initialized from the image encoder and trained alongside a fixed and . It represents events as a single-frame grayscale input derived via with , and optimizes with a composite loss (plus for fine-tuning) to preserve CLIP's capabilities while learning event-specific features. The approach achieves state-of-the-art object recognition, strong zero-shot and few-shot performance, generalizes to video-extracted events, and enables cross-modal retrieval and multi-modal interaction across five modalities (Image, Event, Text, Sound, Depth). These results demonstrate the viability of cross-modal learning with limited event data and broaden the practical impact of event modality in anomaly detection, retrieval, and integrated sensing tasks.

Abstract

This paper introduces a powerful encoder that transfers CLIP`s capabilities to event-based data, enhancing its utility and expanding its applicability across diverse domains. While large-scale datasets have significantly advanced image-based models, the scarcity of comprehensive event datasets has limited performance potential in event modality. To address this challenge, we adapt CLIP`s architecture to align event embeddings with image embeddings, supporting zero-shot learning and preserving text alignment while mitigating catastrophic forgetting. Our encoder achieves strong performance in object recognition, with competitive results in zero-shot and few-shot learning tasks. Notably, it generalizes effectively to events extracted from video data without requiring additional training, highlighting its versatility. Additionally, we integrate this encoder within a cross-modality framework that facilitates interaction across five modalities-Image, Event, Text, Sound, and Depth-expanding the possibilities for cross-modal applications. Overall, this work underscores the transformative potential of a robust event encoder, broadening the scope and utility of event-based data across various fields.

Paper Structure

This paper contains 14 sections, 1 theorem, 7 equations, 5 figures, 4 tables.

Key Result

Proposition 1

Let $f_E$ be the query encoder, $f_I$ the fixed key encoder, and $f_T$ the text embedding module. The ZSCL (Zero-Shot Contrastive Learning) zheng2023preventing objective aligns the query embeddings $f_E(E)$ with the text embeddings $f_T(T)$, and the key embeddings $f_I(I)$ with $f_T(T)$, through th The parameter update for $\theta_E$ follows a momentum-like behavior, aligning $f_E$ with the fixed

Figures (5)

  • Figure 1: Overview of the proposed approach for aligning event and image representations within the CLIP framework. The image and event data are processed through separate encoders, with the image encoder $f_I$ and text encoder $f_T$ frozen, while the event encoder $f_E$ is trainable. Various loss functions, including $L_{\text{ct}}$, $L_{\text{zs}}$, and $L_{\text{kl}}$, ensure robust alignment across modalities and prevent collapse, facilitating the learning of shared features between events and images. $L_{\text{pred}}$ is used only during the fine-tuning stage, where it provides direct supervision by aligning the prediction of $f_E$, $y^\prime$, with one-hot labels $y$.
  • Figure 2: Extracting events from video frames. The differences between frames are activated based on threshold, and the resulting events are sequentially stacked to generate $E$ from ($f_0 \sim f_n$). where $N$ denotes the total number of frames in the stack.
  • Figure 3: Event retrieval process across different modalities (Image, Text, Sound, Depth) using the event as query. The query event calculates the maximum similarity with each modality’s key embedding, returning the key modality with the highest similarity score.
  • Figure 4: Zero-shot accuracy on unseen classes during N-ImageNet pre-training, measured for different loss configurations. The figure illustrates how the inclusion or exclusion of each loss component affects performance on unseen classes.
  • Figure 5: Visualize grad-based attention maps and text relevance scores with our pre-trained model (a) N-ImageNet unseen classes, (b) N-Caltech, (c) UCFCrime following the method in chefer2021generic.

Theorems & Definitions (1)

  • Proposition 1