Table of Contents
Fetching ...

CEIA: CLIP-Based Event-Image Alignment for Open-World Event-Based Understanding

Wenhao Xu, Wenming Weng, Yueyi Zhang, Zhiwei Xiong

TL;DR

CEIA tackles the shortage of large-scale paired event-text data by learning an event encoder aligned with CLIP’s image space through cross-modal contrastive learning on abundant event-image pairs. By initializing the event encoder from CLIP’s image encoder and applying LoRA-based finetuning, CEIA preserves strong zero-shot capabilities while achieving robust alignment across event, image, and text modalities via the bridging role of images. The framework delivers state-of-the-art results on zero-shot object recognition, event-image and event-text retrieval, and domain adaptation, demonstrating strong transferability and scalability with more training data. This open-world, multimodal alignment approach broadens event-based understanding and points to scalable avenues for future event-based large vision models.

Abstract

We present CEIA, an effective framework for open-world event-based understanding. Currently training a large event-text model still poses a huge challenge due to the shortage of paired event-text data. In response to this challenge, CEIA learns to align event and image data as an alternative instead of directly aligning event and text data. Specifically, we leverage the rich event-image datasets to learn an event embedding space aligned with the image space of CLIP through contrastive learning. In this way, event and text data are naturally aligned via using image data as a bridge. Particularly, CEIA offers two distinct advantages. First, it allows us to take full advantage of the existing event-image datasets to make up the shortage of large-scale event-text datasets. Second, leveraging more training data, it also exhibits the flexibility to boost performance, ensuring scalable capability. In highlighting the versatility of our framework, we make extensive evaluations through a diverse range of event-based multi-modal applications, such as object recognition, event-image retrieval, event-text retrieval, and domain adaptation. The outcomes demonstrate CEIA's distinct zero-shot superiority over existing methods on these applications.

CEIA: CLIP-Based Event-Image Alignment for Open-World Event-Based Understanding

TL;DR

CEIA tackles the shortage of large-scale paired event-text data by learning an event encoder aligned with CLIP’s image space through cross-modal contrastive learning on abundant event-image pairs. By initializing the event encoder from CLIP’s image encoder and applying LoRA-based finetuning, CEIA preserves strong zero-shot capabilities while achieving robust alignment across event, image, and text modalities via the bridging role of images. The framework delivers state-of-the-art results on zero-shot object recognition, event-image and event-text retrieval, and domain adaptation, demonstrating strong transferability and scalability with more training data. This open-world, multimodal alignment approach broadens event-based understanding and points to scalable avenues for future event-based large vision models.

Abstract

We present CEIA, an effective framework for open-world event-based understanding. Currently training a large event-text model still poses a huge challenge due to the shortage of paired event-text data. In response to this challenge, CEIA learns to align event and image data as an alternative instead of directly aligning event and text data. Specifically, we leverage the rich event-image datasets to learn an event embedding space aligned with the image space of CLIP through contrastive learning. In this way, event and text data are naturally aligned via using image data as a bridge. Particularly, CEIA offers two distinct advantages. First, it allows us to take full advantage of the existing event-image datasets to make up the shortage of large-scale event-text datasets. Second, leveraging more training data, it also exhibits the flexibility to boost performance, ensuring scalable capability. In highlighting the versatility of our framework, we make extensive evaluations through a diverse range of event-based multi-modal applications, such as object recognition, event-image retrieval, event-text retrieval, and domain adaptation. The outcomes demonstrate CEIA's distinct zero-shot superiority over existing methods on these applications.
Paper Structure (18 sections, 10 equations, 4 figures, 8 tables)

This paper contains 18 sections, 10 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: (a) Compared with EventCLIP wu2023eventclip that directly utilizes the frozen CLIP's image encoder, our CEIA learns an event encoder to alleviate the event-image modality disparity. (b) Comparison of our CEIA and EventCLIP wu2023eventclip on various datasets and tasks. For zero-shot recognition and domain adaptation, we report Acc1 (%), while for event-image retrieval and event-text retrieval, we report R@1 (%) lee2018stacked.
  • Figure 2: Overview of CEIA, which consists of a learnable event encoder, a frozen image encoder, and a frozen text encoder. We initialize the event encoder with CLIP's image encoder and finetune it using the LoRA hu2021lora technique. We align the event embedding space and image embedding space through contrastive learning. In highlighting the versatility of CEIA, we make evaluations on four applications: object recognition, event-image retrieval, event-text retrieval, and domain adaptation.
  • Figure 3: Qualitative results of event-image retrieval and event-text retrieval.
  • Figure 4: Comparison of training CEIA with different scale data.