Table of Contents
Fetching ...

OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies

Lingdong Kong, Youquan Liu, Lai Xing Ng, Benoit R. Cottereau, Wei Tsang Ooi

TL;DR

OpenESS addresses the challenge of scalable, open-world semantic understanding for event camera data by distilling CLIP knowledge into sparse event streams. It jointly optimizes frame-to-event contrastive distillation ($L_{F2E}$) and text-to-event consistency regularization ($L_{T2E}$) to align event representations with image and text semantics, using flexible event representations (voxel grids, reconstructions, or spikes). The approach achieves state-of-the-art results on DDD17-Seg and DSEC-Semantic under annotation-free and annotation-efficient settings and enables open-vocabulary predictions beyond fixed label sets. This work reduces annotation burden while enabling robust, real-time dense scene understanding in dynamic environments through cross-modality knowledge transfer.

Abstract

Event-based semantic segmentation (ESS) is a fundamental yet challenging task for event camera sensing. The difficulties in interpreting and annotating event data limit its scalability. While domain adaptation from images to event data can help to mitigate this issue, there exist data representational differences that require additional effort to resolve. In this work, for the first time, we synergize information from image, text, and event-data domains and introduce OpenESS to enable scalable ESS in an open-world, annotation-efficient manner. We achieve this goal by transferring the semantically rich CLIP knowledge from image-text pairs to event streams. To pursue better cross-modality adaptation, we propose a frame-to-event contrastive distillation and a text-to-event semantic consistency regularization. Experimental results on popular ESS benchmarks showed our approach outperforms existing methods. Notably, we achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels.

OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies

TL;DR

OpenESS addresses the challenge of scalable, open-world semantic understanding for event camera data by distilling CLIP knowledge into sparse event streams. It jointly optimizes frame-to-event contrastive distillation () and text-to-event consistency regularization () to align event representations with image and text semantics, using flexible event representations (voxel grids, reconstructions, or spikes). The approach achieves state-of-the-art results on DDD17-Seg and DSEC-Semantic under annotation-free and annotation-efficient settings and enables open-vocabulary predictions beyond fixed label sets. This work reduces annotation burden while enabling robust, real-time dense scene understanding in dynamic environments through cross-modality knowledge transfer.

Abstract

Event-based semantic segmentation (ESS) is a fundamental yet challenging task for event camera sensing. The difficulties in interpreting and annotating event data limit its scalability. While domain adaptation from images to event data can help to mitigate this issue, there exist data representational differences that require additional effort to resolve. In this work, for the first time, we synergize information from image, text, and event-data domains and introduce OpenESS to enable scalable ESS in an open-world, annotation-efficient manner. We achieve this goal by transferring the semantically rich CLIP knowledge from image-text pairs to event streams. To pursue better cross-modality adaptation, we propose a frame-to-event contrastive distillation and a text-to-event semantic consistency regularization. Experimental results on popular ESS benchmarks showed our approach outperforms existing methods. Notably, we achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels.
Paper Structure (32 sections, 7 equations, 12 figures, 11 tables)

This paper contains 32 sections, 7 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Open-vocabulary event-based semantic segmentation (OpenESS). Our framework is capable of performing zero-shot semantic segmentation of event data streams with open vocabularies. Given raw events and text prompts as inputs, OpenESS outputs semantically coherent open-world predictions across various adjective, fine-grained, and coarse categories. The last three columns show the language-guided attention maps where regions of a high similarity score to the given text prompts are highlighted. Best viewed in colors.
  • Figure 2: Architecture overview of the OpenESS framework. We distill off-the-shelf knowledge from vision-languages models to event representations (cf.\ref{['sec:revisit']}). Given a calibrated event $I^{evt}$ and a frame $I^{img}$, we extract their features from the event network $\mathcal{F}^{evt}_{\theta_{e}}$ and the densified CLIP's image encoder $\mathcal{F}^{clip}_{\theta_{c}}$, which are then combined with the text embedding from CLIP's text encoder $\mathcal{F}^{txt}_{\theta_{t}}$ for open-world prediction (cf.\ref{['sec:ov-ess']}). To better serve for cross-modality knowledge transfer, we propose a frame-to-event (F2E) contrastive objective (cf.\ref{['sec:f2e']}) via superpixel-driven distillation and a text-to-event (T2E) consistency objective (cf.\ref{['sec:t2e']}) via scene-level regularization.
  • Figure 3: Ablation study on the number of superpixels (provided by either SAM kirillov2023segment or SLIC achanta2012slic) involved in calculating the frame-to-event contrastive loss. Models after pre-training are fine-tuned with 1% annotations. All mIoU scores are in percentage ($\%$).
  • Figure 4: Qualitative comparisons of state-of-the-art ESS approaches on the test set of DSEC-Semanticsun2022ess. Each color corresponds to a distinct semantic category. GT denotes the ground truth semantic maps. Best viewed in colors and zoomed-in for additional details.
  • Figure 5: Cross-dataset representation learning results of comparing OpenESS pre-training using in-distribution (ID) and out-of-distribution (OOD) data in-between the DDD17-Segbinas2017ddd17 and DSEC-Semanticsun2022ess datasets. Models after pre-training are fine-tuned with 1%, 5%, 10%, and 20% annotations, respectively.
  • ...and 7 more figures