Table of Contents
Fetching ...

EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding

Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, Lin Wang

TL;DR

EventBind tackles open-world event recognition by learning a unified embedding space for images, texts, and asynchronous event data. It introduces an event encoder that models temporal dynamics and generates event prompts, a text encoder with hybrid prompts and content prompts, and a Hierarchical Triple Contrastive Alignment (HTCA) to jointly align all three modalities. The total objective $L_{final} = \alpha L(f^{i}, f^{e}) + \beta L(f^{e}, f^{t,e}) + \theta L(f^{t,i}, f^{t,e}) + \gamma MSE( f^{t,m}_{l}, f^{t,m}_{h})$ consolidates multiple contrastive and consistency losses to train in a unified space. Across three benchmarks (N-Caltech101, N-Imagenet, N-MNIST), EventBind achieves state-of-the-art results in fine-tuning and few-shot settings and demonstrates effective event retrieval with text or image queries, underscoring its practical potential for scalable, open-world event understanding.

Abstract

In this paper, we propose EventBind, a novel and effective framework that unleashes the potential of vision-language models (VLMs) for event-based recognition to compensate for the lack of large-scale event-based datasets. In particular, due to the distinct modality gap with the image-text data and the lack of large-scale datasets, learning a common representation space for images, texts, and events is non-trivial.Intuitively, we need to address two key challenges: 1) how to generalize CLIP's visual encoder to event data while fully leveraging events' unique properties, e.g., sparsity and high temporal resolution; 2) how to effectively align the multi-modal embeddings, i.e., image, text, and events. Accordingly, we first introduce a novel event encoder that subtly models the temporal information from events and meanwhile, generates event prompts for modality bridging. We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind's generalization ability across diverse datasets.With the proposed event encoder, text encoder, and image encoder, a novel Hierarchical Triple Contrastive Alignment (HTCA) module is introduced to jointly optimize the correlation and enable efficient knowledge transfer among the three modalities. We evaluate various settings, including fine-tuning and few-shot on three benchmarks, and our EventBind achieves new state-of-the-art accuracy compared with the previous methods, such as on N-Caltech101 (+5.34% and +1.70%) and N-Imagenet (+5.65% and +1.99%) with fine-tuning and 20-shot settings, respectively. Moreover, our EventBind can be flexibly extended to the event retrieval task using text or image queries, showing plausible performance. Project page:https://vlislab22.github.io/EventBind/.

EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding

TL;DR

EventBind tackles open-world event recognition by learning a unified embedding space for images, texts, and asynchronous event data. It introduces an event encoder that models temporal dynamics and generates event prompts, a text encoder with hybrid prompts and content prompts, and a Hierarchical Triple Contrastive Alignment (HTCA) to jointly align all three modalities. The total objective consolidates multiple contrastive and consistency losses to train in a unified space. Across three benchmarks (N-Caltech101, N-Imagenet, N-MNIST), EventBind achieves state-of-the-art results in fine-tuning and few-shot settings and demonstrates effective event retrieval with text or image queries, underscoring its practical potential for scalable, open-world event understanding.

Abstract

In this paper, we propose EventBind, a novel and effective framework that unleashes the potential of vision-language models (VLMs) for event-based recognition to compensate for the lack of large-scale event-based datasets. In particular, due to the distinct modality gap with the image-text data and the lack of large-scale datasets, learning a common representation space for images, texts, and events is non-trivial.Intuitively, we need to address two key challenges: 1) how to generalize CLIP's visual encoder to event data while fully leveraging events' unique properties, e.g., sparsity and high temporal resolution; 2) how to effectively align the multi-modal embeddings, i.e., image, text, and events. Accordingly, we first introduce a novel event encoder that subtly models the temporal information from events and meanwhile, generates event prompts for modality bridging. We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind's generalization ability across diverse datasets.With the proposed event encoder, text encoder, and image encoder, a novel Hierarchical Triple Contrastive Alignment (HTCA) module is introduced to jointly optimize the correlation and enable efficient knowledge transfer among the three modalities. We evaluate various settings, including fine-tuning and few-shot on three benchmarks, and our EventBind achieves new state-of-the-art accuracy compared with the previous methods, such as on N-Caltech101 (+5.34% and +1.70%) and N-Imagenet (+5.65% and +1.99%) with fine-tuning and 20-shot settings, respectively. Moreover, our EventBind can be flexibly extended to the event retrieval task using text or image queries, showing plausible performance. Project page:https://vlislab22.github.io/EventBind/.
Paper Structure (13 sections, 8 equations, 7 figures, 16 tables)

This paper contains 13 sections, 8 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Overview of our EventBind, which extracts the events' high temporal and sparse spatial information via the proposed Event Encoder and aligns event, image, and text embeddings in the unified representation space with a novel Hierarchical Triple Contrastive Alignment (HTCA) module. EventBind solves various practical tasks like open-world object recognition and few-shot object recognition with significant performance improvements compared to the previous best models liu2022fastklenk2024masked. Our EventBind framework can be flexibly extended to image-to-event and text-to-event retrieval tasks.
  • Figure 2: Overview framework of EventBind, which consists of image encoder pre-aligned with text encoder and the proposed event encoder. It takes image(optional), text, and event as input, generating the image embeddings $f^{i}$, event embeddings $f^{e}$ and text embeddings $f^{t,e}, f^{t,i}$. Then all output embeddings are aligned in the HTCA module to establish a unified representation space.
  • Figure 3: The architecture of our event encoder consists of two key technical parts: (a) Temporal modeling consisting of temporal encoding and cross-frame prompts for event spatial-temporal modeling by introducing information exchange between event frames; (b) Event prompts generate event modality prompts to provide additional parameters for modality bridging.
  • Figure 4: Ablation of hyperparameters: the aggregated event counts per frame N based on zero-shot performance with 3 different ViT backbones across three datasets.
  • Figure 5: Visual examples for the event retrieval task.
  • ...and 2 more figures