Table of Contents
Fetching ...

CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework

Wentao Wu, Xiao Wang, Chenglong Li, Bo Jiang, Jin Tang, Bin Luo, Qi Liu

TL;DR

CM3AE addresses the lack of strong RGB–Event cross-modal alignment in event-based pre-training by introducing a dual-branch masked autoencoder augmented with a multimodal fusion reconstruction module and multimodal contrastive learning. It jointly learns from RGB, Event, and voxel data, enabling robust unimodal and RGB-Event fusion downstream tasks. A large-scale REV2M dataset of 2.54 million RGB-Event pairs supports the pre-training, and extensive experiments across five downstream tasks demonstrate superior performance over prior pre-trained models. This framework advances multimodal perception with event cameras, offering a scalable foundation for RGB-Event fusion in real-world applications.

Abstract

Event cameras have attracted increasing attention in recent years due to their advantages in high dynamic range, high temporal resolution, low power consumption, and low latency. Some researchers have begun exploring pre-training directly on event data. Nevertheless, these efforts often fail to establish strong connections with RGB frames, limiting their applicability in multi-modal fusion scenarios. To address these issues, we propose a novel CM3AE pre-training framework for the RGB-Event perception. This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels, providing robust support for both event-based and RGB-event fusion based downstream tasks. Specifically, we design a multi-modal fusion reconstruction module that reconstructs the original image from fused multi-modal features, explicitly enhancing the model's ability to aggregate cross-modal complementary information. Additionally, we employ a multi-modal contrastive learning strategy to align cross-modal feature representations in a shared latent space, which effectively enhances the model's capability for multi-modal understanding and capturing global dependencies. We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training. Extensive experiments on five downstream tasks fully demonstrated the effectiveness of CM3AE. Source code and pre-trained models will be released on https://github.com/Event-AHU/CM3AE.

CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework

TL;DR

CM3AE addresses the lack of strong RGB–Event cross-modal alignment in event-based pre-training by introducing a dual-branch masked autoencoder augmented with a multimodal fusion reconstruction module and multimodal contrastive learning. It jointly learns from RGB, Event, and voxel data, enabling robust unimodal and RGB-Event fusion downstream tasks. A large-scale REV2M dataset of 2.54 million RGB-Event pairs supports the pre-training, and extensive experiments across five downstream tasks demonstrate superior performance over prior pre-trained models. This framework advances multimodal perception with event cameras, offering a scalable foundation for RGB-Event fusion in real-world applications.

Abstract

Event cameras have attracted increasing attention in recent years due to their advantages in high dynamic range, high temporal resolution, low power consumption, and low latency. Some researchers have begun exploring pre-training directly on event data. Nevertheless, these efforts often fail to establish strong connections with RGB frames, limiting their applicability in multi-modal fusion scenarios. To address these issues, we propose a novel CM3AE pre-training framework for the RGB-Event perception. This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels, providing robust support for both event-based and RGB-event fusion based downstream tasks. Specifically, we design a multi-modal fusion reconstruction module that reconstructs the original image from fused multi-modal features, explicitly enhancing the model's ability to aggregate cross-modal complementary information. Additionally, we employ a multi-modal contrastive learning strategy to align cross-modal feature representations in a shared latent space, which effectively enhances the model's capability for multi-modal understanding and capturing global dependencies. We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training. Extensive experiments on five downstream tasks fully demonstrated the effectiveness of CM3AE. Source code and pre-trained models will be released on https://github.com/Event-AHU/CM3AE.

Paper Structure

This paper contains 16 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Our proposed pre-trained CM3AE model takes RGB-Event Streams/-Frames as the input and supports unimodal and multimodal based downstream tasks, such as object detection and tracking, action recognition, etc.
  • Figure 2: An overview of our proposed pre-training framework for RGB-Event stream perceptron, termed CM3AE. Specifically, the framework supports multi-modalities/views of data as input, incorporating a multi-modal fusion generation module and a multi-modal contrastive learning strategy. These designs effectively enhance the model's ability to aggregate cross-modal information and improve multi-modal understanding, significantly boosting performance across various event-based single-modal tasks and RGB-Event fused multi-modal downstream tasks.
  • Figure 3: Representative samples in our pre-training dataset.
  • Figure 4: Visualization of the experimental results of each downstream task.
  • Figure 5: Visualization of attention maps on different downstream tasks.
  • ...and 1 more figures