OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras

Muhammad Rameez Ur Rahman; Jhony H. Giraldo; Indro Spinelli; Stéphane Lathuilière; Fabio Galasso

OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras

Muhammad Rameez Ur Rahman, Jhony H. Giraldo, Indro Spinelli, Stéphane Lathuilière, Fabio Galasso

TL;DR

OVOSE addresses open-vocabulary semantic segmentation for event cameras, where labeled event data are scarce and existing methods are closed-set. It introduces a two-branch architecture with a grayscale-image branch and an event-branch, both initialized from image foundation models, and utilizes synthetic data with knowledge distillation to transfer open-vocabulary capabilities to events. A dissimilarity network reweights the distillation losses to focus on well-reconstructed regions, and a mask generator plus CLIP-style text encoder enable open-set class predictions. Evaluations on DDD17 and DSEC-Semantic show OVOSE surpasses both closed-set event-segmentation baselines and image-based open-vocabulary adaptations, achieving leading mIoU and accuracy. The work demonstrates practical potential for real-world open-vocabulary segmentation in event-based perception.

Abstract

Event cameras, known for low-latency operation and superior performance in challenging lighting conditions, are suitable for sensitive computer vision tasks such as semantic segmentation in autonomous driving. However, challenges arise due to limited event-based data and the absence of large-scale segmentation benchmarks. Current works are confined to closed-set semantic segmentation, limiting their adaptability to other applications. In this paper, we introduce OVOSE, the first Open-Vocabulary Semantic Segmentation algorithm for Event cameras. OVOSE leverages synthetic event data and knowledge distillation from a pre-trained image-based foundation model to an event-based counterpart, effectively preserving spatial context and transferring open-vocabulary semantic segmentation capabilities. We evaluate the performance of OVOSE on two driving semantic segmentation datasets DDD17, and DSEC-Semantic, comparing it with existing conventional image open-vocabulary models adapted for event-based data. Similarly, we compare OVOSE with state-of-the-art methods designed for closed-set settings in unsupervised domain adaptation for event-based semantic segmentation. OVOSE demonstrates superior performance, showcasing its potential for real-world applications. The code is available at https://github.com/ram95d/OVOSE.

OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras

TL;DR

Abstract

Paper Structure (15 sections, 6 equations, 7 figures, 4 tables)

This paper contains 15 sections, 6 equations, 7 figures, 4 tables.

Introduction
Related Work
Knowledge Distillation
Open-vocabulary Segmentation in Events
Preliminaries.
Overview of OVOSE
Distilling Image Embeddings
Feature Distillation
Mask Re-weighting
Category Label Supervision
Experiments and Results
Experimental Framework
Results
Ablation Studies
Conclusion

Figures (7)

Figure 1: Output of a regular RGB foundation model for semantic segmentation and OVOSE in event-based data. OVOSE accurately segments person, trees, and (the sky).
Figure 2: Overview of OVOSE pipeline. Our algorithm comprises two components: the original grayscale image branch and the event-based branch. Initially, events are transformed into a grayscale image using the E2VID model. Subsequently, both the original and reconstructed grayscale images undergo text embedding through an image encoder and an MLP. The features from a frozen text-to-image diffusion UNet are then extracted for each tuple of image and text embedding. For each branch, a mask generator predicts class-agnostic binary masks and associated mask embedding features. Categorization is achieved through a dot product between mask embedding features and text embeddings. Both branches are initialized with ODISE weights xu2023open, and knowledge distillation occurs from the original image branch to the event-based branch during training. Original and reconstructed images are input into a dissimilarity network to weigh the distillation in the outputs. During the evaluation, only the event-based branch is utilized.
Figure 3: Dissimilarity network takes the grayscale and reconstructed images as input, and it outputs an error map to reweight the mask loss. E2VID is unable to reconstruct the stripes and hence considered a high error area by the dissimilarity network.
Figure 4: The impact of reweighting the mask loss, influenced by the dissimilarity between the grayscale and reconstructed images. Poorly reconstructed areas such as the person and the elephant's trunk lead to their exclusion in the reweighting process.
Figure 5: Qualitative samples from ESS in UDA closed-set, E2VID+ODISE, and OVOSE in open vocabulary setting. As compared to ESS and E2VID+ODISE, OVOSE produce accurate and less noisy predictions even though it is trained on a synthetic dataset.
...and 2 more figures

OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras

TL;DR

Abstract

OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras

Authors

TL;DR

Abstract

Table of Contents

Figures (7)