Expanding Event Modality Applications through a Robust CLIP-Based Encoder
Sungheon Jeong, Hanning Chen, Sanggeon Yun, Suhyeon Cho, Wenjun Huang, Xiangjian Liu, Mohsen Imani
TL;DR
This work addresses the scarcity of large event datasets by transferring CLIP's image-text alignment to the event modality through a robust event encoder $f_E$ initialized from the image encoder $f_I$ and trained alongside a fixed $f_I$ and $f_T$. It represents events as a single-frame grayscale input $E$ derived via $E(x,y) = \sum_p\sum_t E(x,y,t,p)$ with $E = \frac{E(x,y)}{\max(E(x,y)) + 1}$, and optimizes with a composite loss $L = L_{ct} + \alpha L_{zs} + L_{kl}$ (plus $L_{pred}$ for fine-tuning) to preserve CLIP's capabilities while learning event-specific features. The approach achieves state-of-the-art object recognition, strong zero-shot and few-shot performance, generalizes to video-extracted events, and enables cross-modal retrieval and multi-modal interaction across five modalities (Image, Event, Text, Sound, Depth). These results demonstrate the viability of cross-modal learning with limited event data and broaden the practical impact of event modality in anomaly detection, retrieval, and integrated sensing tasks.
Abstract
This paper introduces a powerful encoder that transfers CLIP`s capabilities to event-based data, enhancing its utility and expanding its applicability across diverse domains. While large-scale datasets have significantly advanced image-based models, the scarcity of comprehensive event datasets has limited performance potential in event modality. To address this challenge, we adapt CLIP`s architecture to align event embeddings with image embeddings, supporting zero-shot learning and preserving text alignment while mitigating catastrophic forgetting. Our encoder achieves strong performance in object recognition, with competitive results in zero-shot and few-shot learning tasks. Notably, it generalizes effectively to events extracted from video data without requiring additional training, highlighting its versatility. Additionally, we integrate this encoder within a cross-modality framework that facilitates interaction across five modalities-Image, Event, Text, Sound, and Depth-expanding the possibilities for cross-modal applications. Overall, this work underscores the transformative potential of a robust event encoder, broadening the scope and utility of event-based data across various fields.
