Table of Contents
Fetching ...

Scaling Dense Event-Stream Pretraining from Visual Foundation Models

Zhiwen Chen, Junhui Hou, Zhiyu Zhu, Jinjian Wu, Guangming Shi

TL;DR

A novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale, and proposes to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision.

Abstract

Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability.

Scaling Dense Event-Stream Pretraining from Visual Foundation Models

TL;DR

A novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale, and proposes to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision.

Abstract

Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability.
Paper Structure (30 sections, 5 equations, 12 figures, 19 tables)

This paper contains 30 sections, 5 equations, 12 figures, 19 tables.

Figures (12)

  • Figure 1: ScaleEvent: building upon large-scale cross-modal knowledge distillation from visual foundation models, we represent a novel pretraining method to scale up event representations. By anchoring dense cross-modal correspondences with a structure-aware loss, we obtain high-quality, fine-grained event representations that exhibit strong generalization across downstream dense perception tasks.
  • Figure 2: Illustration of event-image feature alignment across granularities. Patch-level alignment exacerbates cross-modal mismatches, superpixel grouping is ambiguous, while semantic structure grounds superior event–image correspondences.
  • Figure 3: Cosine similarity maps obtained with DINOv3 output features (anchored at the distinct white stars). The image features exhibit coherent grouping induced by a strong off-the-shelf semantic structure.
  • Figure 4: Comparison of dense event features under different distillation strategies. All features are produced by a DINOv3-ViT-B model. Left to right: as spatial resolution increases, event representations degrade to varying degrees. PCA maps become less localized, and similarity maps (anchored at the red dot) become noisier. Top to bottom: (a) image features; (b) patch-level distillation; (c) superpixel-level distillation; (d) patch-level distillation + event activation mask; (e) patch-level distillation + event activation mask + structure-aware regularization (Our method). A more detailed feature analysis is provided in the supplementary materials.
  • Figure 5: Comparison of dense image features under different visual foundation models through a toy example.
  • ...and 7 more figures