Table of Contents
Fetching ...

Data-efficient Event Camera Pre-training via Disentangled Masked Modeling

Zhenpeng Huang, Chao Li, Hao Chen, Yongjian Deng, Yifeng Geng, Limin Wang

TL;DR

This work tackles data efficiency in self-supervised learning for event cameras by introducing a voxel-based SSL framework that preserves temporal cues without RGB supervision. It decouples reconstruction into local spatio-temporal and global semantic branches (disentangled masked modeling) and introduces semantic-uniform masking to balance learning across regions. The approach delivers strong gains across object recognition, detection, segmentation, and action recognition with fewer parameters and lower FLOPs, demonstrating improved generalization and practicality for real-world event-camera applications. By preserving sparsity and motion cues inherent to event data, the method offers a practical, scalable path for pre-training compact event-based backbones.

Abstract

In this paper, we present a new data-efficient voxel-based self-supervised learning method for event cameras. Our pre-training overcomes the limitations of previous methods, which either sacrifice temporal information by converting event sequences into 2D images for utilizing pre-trained image models or directly employ paired image data for knowledge distillation to enhance the learning of event streams. In order to make our pre-training data-efficient, we first design a semantic-uniform masking method to address the learning imbalance caused by the varying reconstruction difficulties of different regions in non-uniform data when using random masking. In addition, we ease the traditional hybrid masked modeling process by explicitly decomposing it into two branches, namely local spatio-temporal reconstruction and global semantic reconstruction to encourage the encoder to capture local correlations and global semantics, respectively. This decomposition allows our selfsupervised learning method to converge faster with minimal pre-training data. Compared to previous approaches, our self-supervised learning method does not rely on paired RGB images, yet enables simultaneous exploration of spatial and temporal cues in multiple scales. It exhibits excellent generalization performance and demonstrates significant improvements across various tasks with fewer parameters and lower computational costs.

Data-efficient Event Camera Pre-training via Disentangled Masked Modeling

TL;DR

This work tackles data efficiency in self-supervised learning for event cameras by introducing a voxel-based SSL framework that preserves temporal cues without RGB supervision. It decouples reconstruction into local spatio-temporal and global semantic branches (disentangled masked modeling) and introduces semantic-uniform masking to balance learning across regions. The approach delivers strong gains across object recognition, detection, segmentation, and action recognition with fewer parameters and lower FLOPs, demonstrating improved generalization and practicality for real-world event-camera applications. By preserving sparsity and motion cues inherent to event data, the method offers a practical, scalable path for pre-training compact event-based backbones.

Abstract

In this paper, we present a new data-efficient voxel-based self-supervised learning method for event cameras. Our pre-training overcomes the limitations of previous methods, which either sacrifice temporal information by converting event sequences into 2D images for utilizing pre-trained image models or directly employ paired image data for knowledge distillation to enhance the learning of event streams. In order to make our pre-training data-efficient, we first design a semantic-uniform masking method to address the learning imbalance caused by the varying reconstruction difficulties of different regions in non-uniform data when using random masking. In addition, we ease the traditional hybrid masked modeling process by explicitly decomposing it into two branches, namely local spatio-temporal reconstruction and global semantic reconstruction to encourage the encoder to capture local correlations and global semantics, respectively. This decomposition allows our selfsupervised learning method to converge faster with minimal pre-training data. Compared to previous approaches, our self-supervised learning method does not rely on paired RGB images, yet enables simultaneous exploration of spatial and temporal cues in multiple scales. It exhibits excellent generalization performance and demonstrates significant improvements across various tasks with fewer parameters and lower computational costs.
Paper Structure (24 sections, 6 equations, 6 figures, 4 tables)

This paper contains 24 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison to state-of-the-art methods on N-caltech101 dataset in terms of accuracy and complexity. FLOPS is proportional to the size of the circle associated with the model.
  • Figure 2: Visualization. Masked voxels are dropped. (a) shows the raw voxel input. (b) is the visible voxels after global random masking. Dense regions will be recovered more easily. (c) is the visible voxel after semantic-uniform masking. It balances the learning difficulty of each local semantic.
  • Figure 3: Overview of our pre-training framework. Data processing workflow: Left- Voxelizing and filtering raw event data. Then, Each uniformly sampled region is randomly masked and fed into the encoder separately. (I) Local Feature Reconstruction Branch: Upper right - Masked voxel feature reconstruction within each local structure. (II) Global Semantic Reconstruction Branch: Lower right - Summary tokens generated by encoder and mean-pooling for each region, followed by masked semantic prediction globally.
  • Figure 4: Different percentages of pre-training data.
  • Figure 5: Different pre-training epochs.
  • ...and 1 more figures