Table of Contents
Fetching ...

Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation

Jinchang Zhang, Zijun Li, Jiakai Lin, Guoyu Lu

TL;DR

This paper tackles open-vocabulary object detection for event cameras by marrying adaptive event slicing with vision-language knowledge distillation. An SNN-based Adaptive Event Stream Slicing module dynamically segments sparse event data at an optimal time $n^*$ using Mem-Loss, LA-Loss, and SSF-Loss, producing discriminative ROI features. A CLIP-guided distillation pipeline transfers rich image-language semantics to the event detector, enabling text-based classification and cross-modal alignment via a spatial attention mechanism, while category-agnostic proposals improve generalization to unseen objects. Experiments on NCAR, Gen1, and DSEC datasets show strong base-category performance and notable open-vocabulary generalization, including zero-shot transfer across datasets, and ablations confirm the critical role of KD and adaptive slicing. The approach demonstrates that open-vocabulary detection is feasible directly from event streams, with a practical impact for low-latency, texture-free sensing in dynamic environments.

Abstract

Event cameras offer advantages in object detection tasks due to high-speed response, low latency, and robustness to motion blur. However, event cameras lack texture and color information, making open-vocabulary detection particularly challenging. Current event-based detection methods are typically trained on predefined categories, limiting their ability to generalize to novel objects, where encountering previously unseen objects is common. Vision-language models (VLMs) have enabled open-vocabulary object detection in RGB images. However, the modality gap between images and event streams makes it ineffective to directly transfer CLIP to event data, as CLIP was not designed for event streams. To bridge this gap, we propose an event-image knowledge distillation framework that leverages CLIP's semantic understanding to achieve open-vocabulary object detection on event data. Instead of training CLIP directly on event streams, we use image frames as inputs to a teacher model, guiding the event-based student model to learn CLIP's rich visual representations. Through spatial attention-based distillation, the student network learns meaningful visual features directly from raw event inputs while inheriting CLIP's broad visual knowledge. Furthermore, to prevent information loss due to event data segmentation, we design a hybrid spiking neural network (SNN) and convolutional neural network (CNN) framework. Unlike fixed-group event segmentation methods, which often discard crucial temporal information, our SNN adaptively determines the optimal event segmentation moments, ensuring that key temporal features are extracted. The extracted event features are then processed by CNNs for object detection.

Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation

TL;DR

This paper tackles open-vocabulary object detection for event cameras by marrying adaptive event slicing with vision-language knowledge distillation. An SNN-based Adaptive Event Stream Slicing module dynamically segments sparse event data at an optimal time using Mem-Loss, LA-Loss, and SSF-Loss, producing discriminative ROI features. A CLIP-guided distillation pipeline transfers rich image-language semantics to the event detector, enabling text-based classification and cross-modal alignment via a spatial attention mechanism, while category-agnostic proposals improve generalization to unseen objects. Experiments on NCAR, Gen1, and DSEC datasets show strong base-category performance and notable open-vocabulary generalization, including zero-shot transfer across datasets, and ablations confirm the critical role of KD and adaptive slicing. The approach demonstrates that open-vocabulary detection is feasible directly from event streams, with a practical impact for low-latency, texture-free sensing in dynamic environments.

Abstract

Event cameras offer advantages in object detection tasks due to high-speed response, low latency, and robustness to motion blur. However, event cameras lack texture and color information, making open-vocabulary detection particularly challenging. Current event-based detection methods are typically trained on predefined categories, limiting their ability to generalize to novel objects, where encountering previously unseen objects is common. Vision-language models (VLMs) have enabled open-vocabulary object detection in RGB images. However, the modality gap between images and event streams makes it ineffective to directly transfer CLIP to event data, as CLIP was not designed for event streams. To bridge this gap, we propose an event-image knowledge distillation framework that leverages CLIP's semantic understanding to achieve open-vocabulary object detection on event data. Instead of training CLIP directly on event streams, we use image frames as inputs to a teacher model, guiding the event-based student model to learn CLIP's rich visual representations. Through spatial attention-based distillation, the student network learns meaningful visual features directly from raw event inputs while inheriting CLIP's broad visual knowledge. Furthermore, to prevent information loss due to event data segmentation, we design a hybrid spiking neural network (SNN) and convolutional neural network (CNN) framework. Unlike fixed-group event segmentation methods, which often discard crucial temporal information, our SNN adaptively determines the optimal event segmentation moments, ensuring that key temporal features are extracted. The extracted event features are then processed by CNNs for object detection.

Paper Structure

This paper contains 14 sections, 10 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The overview of our framework. The event stream is first fed into a spiking neural network, where Self-Supervised Feedback Loss is utilized to dynamically adjust the membrane potential based on object detection results, enabling adaptive event segmentation and feature extraction. We transfer image knowledge from CLIP to event data, using the CLIP image encoder as a teacher model. Through knowledge distillation, the student detector trained on event data learns the rich visual representations from CLIP. Additionally, category text is input into the frozen CLIP text encoder to generate text embeddings, and the cosine similarity between each region embedding and all category text embeddings is computed for object classification. During the inference phase, the model performs open-vocabulary object detection using only event stream data, without relying on image frames.
  • Figure 2: Open Vocabulary Object Detection results on DSEC datasetgehrig2021dsec: From left to right; the models are Event frame; ViLD gu2021open; RegionCLIP zhong2022regionclip; YOLO-World cheng2024yolo; Ours.