Table of Contents
Fetching ...

Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras

Hoonhee Cho, Sung-Hoon Yoon, Hyeokjun Kweon, Kuk-Jin Yoon

TL;DR

EV-WSSS tackles the challenge of dense pixel-wise semantic segmentation for event cameras under sparse supervision by introducing 1-class-1-click labels and leveraging asymmetric dual-student learning on forward $E^f$ and backward $E^b$ event streams. It fuses this with feature-level prototype-based contrastive learning, employing intra-, inter-, and cross-branch aggregation via prototype distillation to sharpen semantic representations without dense GT. The approach is validated on DDD17-Seg, DSEC-Semantic, and the newly released DSEC Night-Point, showing strong gains over baselines and robustness to incomplete or noisy annotations, as well as competitive performance in UDA settings with weak target-domain labels. The work provides practical benefits for event-based segmentation in challenging conditions (e.g., nighttime) and contributes a new dataset and accessible code for community use.

Abstract

Event cameras excel in capturing high-contrast scenes and dynamic objects, offering a significant advantage over traditional frame-based cameras. Despite active research into leveraging event cameras for semantic segmentation, generating pixel-wise dense semantic maps for such challenging scenarios remains labor-intensive. As a remedy, we present EV-WSSS: a novel weakly supervised approach for event-based semantic segmentation that utilizes sparse point annotations. To fully leverage the temporal characteristics of event data, the proposed framework performs asymmetric dual-student learning between 1) the original forward event data and 2) the longer reversed event data, which contain complementary information from the past and the future, respectively. Besides, to mitigate the challenges posed by sparse supervision, we propose feature-level contrastive learning based on class-wise prototypes, carefully aggregated at both spatial region and sample levels. Additionally, we further excavate the potential of our dual-student learning model by exchanging prototypes between the two learning paths, thereby harnessing their complementary strengths. With extensive experiments on various datasets, including DSEC Night-Point with sparse point annotations newly provided by this paper, the proposed method achieves substantial segmentation results even without relying on pixel-level dense ground truths. The code and dataset are available at https://github.com/Chohoonhee/EV-WSSS.

Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras

TL;DR

EV-WSSS tackles the challenge of dense pixel-wise semantic segmentation for event cameras under sparse supervision by introducing 1-class-1-click labels and leveraging asymmetric dual-student learning on forward and backward event streams. It fuses this with feature-level prototype-based contrastive learning, employing intra-, inter-, and cross-branch aggregation via prototype distillation to sharpen semantic representations without dense GT. The approach is validated on DDD17-Seg, DSEC-Semantic, and the newly released DSEC Night-Point, showing strong gains over baselines and robustness to incomplete or noisy annotations, as well as competitive performance in UDA settings with weak target-domain labels. The work provides practical benefits for event-based segmentation in challenging conditions (e.g., nighttime) and contributes a new dataset and accessible code for community use.

Abstract

Event cameras excel in capturing high-contrast scenes and dynamic objects, offering a significant advantage over traditional frame-based cameras. Despite active research into leveraging event cameras for semantic segmentation, generating pixel-wise dense semantic maps for such challenging scenarios remains labor-intensive. As a remedy, we present EV-WSSS: a novel weakly supervised approach for event-based semantic segmentation that utilizes sparse point annotations. To fully leverage the temporal characteristics of event data, the proposed framework performs asymmetric dual-student learning between 1) the original forward event data and 2) the longer reversed event data, which contain complementary information from the past and the future, respectively. Besides, to mitigate the challenges posed by sparse supervision, we propose feature-level contrastive learning based on class-wise prototypes, carefully aggregated at both spatial region and sample levels. Additionally, we further excavate the potential of our dual-student learning model by exchanging prototypes between the two learning paths, thereby harnessing their complementary strengths. With extensive experiments on various datasets, including DSEC Night-Point with sparse point annotations newly provided by this paper, the proposed method achieves substantial segmentation results even without relying on pixel-level dense ground truths. The code and dataset are available at https://github.com/Chohoonhee/EV-WSSS.
Paper Structure (22 sections, 10 equations, 7 figures, 8 tables)

This paper contains 22 sections, 10 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: (a) Motivation of the event-based weakly supervised semantic segmentation. (b) Performance comparisons between our baseline, baseline with dual-student learning, and our final model on DSEC-Semantic Sun2022ESSLE and DSEC Night-Point datasets.
  • Figure 2: Overview of the proposed EV-WSSS framework. We omit the details about prototype-related components in this figure for better understanding.
  • Figure 3: Visualization of the proposed prototype-based contrastive learning approaches based on the aggregations performed in three different levels.
  • Figure 4: (a) Comparison with various self-supervised approaches and (b) the simplified training pipeline for the respective approaches. For (IV), we provide the performance of ours without the prototype-related components for a fair comparison.
  • Figure 5: Qualitative ablation of EV-WSSS framework. (a) visualized event data, (b) results of baseline, (c) results of our final model, (d) segmentation GT, and (e) image.
  • ...and 2 more figures