EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, Lin Wang
TL;DR
EventBind tackles open-world event recognition by learning a unified embedding space for images, texts, and asynchronous event data. It introduces an event encoder that models temporal dynamics and generates event prompts, a text encoder with hybrid prompts and content prompts, and a Hierarchical Triple Contrastive Alignment (HTCA) to jointly align all three modalities. The total objective $L_{final} = \alpha L(f^{i}, f^{e}) + \beta L(f^{e}, f^{t,e}) + \theta L(f^{t,i}, f^{t,e}) + \gamma MSE( f^{t,m}_{l}, f^{t,m}_{h})$ consolidates multiple contrastive and consistency losses to train in a unified space. Across three benchmarks (N-Caltech101, N-Imagenet, N-MNIST), EventBind achieves state-of-the-art results in fine-tuning and few-shot settings and demonstrates effective event retrieval with text or image queries, underscoring its practical potential for scalable, open-world event understanding.
Abstract
In this paper, we propose EventBind, a novel and effective framework that unleashes the potential of vision-language models (VLMs) for event-based recognition to compensate for the lack of large-scale event-based datasets. In particular, due to the distinct modality gap with the image-text data and the lack of large-scale datasets, learning a common representation space for images, texts, and events is non-trivial.Intuitively, we need to address two key challenges: 1) how to generalize CLIP's visual encoder to event data while fully leveraging events' unique properties, e.g., sparsity and high temporal resolution; 2) how to effectively align the multi-modal embeddings, i.e., image, text, and events. Accordingly, we first introduce a novel event encoder that subtly models the temporal information from events and meanwhile, generates event prompts for modality bridging. We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind's generalization ability across diverse datasets.With the proposed event encoder, text encoder, and image encoder, a novel Hierarchical Triple Contrastive Alignment (HTCA) module is introduced to jointly optimize the correlation and enable efficient knowledge transfer among the three modalities. We evaluate various settings, including fine-tuning and few-shot on three benchmarks, and our EventBind achieves new state-of-the-art accuracy compared with the previous methods, such as on N-Caltech101 (+5.34% and +1.70%) and N-Imagenet (+5.65% and +1.99%) with fine-tuning and 20-shot settings, respectively. Moreover, our EventBind can be flexibly extended to the event retrieval task using text or image queries, showing plausible performance. Project page:https://vlislab22.github.io/EventBind/.
