Weakly Supervised Semantic Segmentation for Driving Scenes

Dongseob Kim; Seungho Lee; Junsuk Choe; Hyunjung Shim

Weakly Supervised Semantic Segmentation for Driving Scenes

Dongseob Kim, Seungho Lee, Junsuk Choe, Hyunjung Shim

TL;DR

This work addresses the gap in weakly supervised semantic segmentation for driving scenes by leveraging CLIP-based pseudo-masks and tailoring the learning process to driving data. The authors propose a two-pronged approach: Global-Local View Training to better localize small objects and CARB to mitigate noise by emphasizing reliable regions via adaptive loss weighting. The method achieves a strong Cityscapes result of $51.8\%$ $mIoU$ and demonstrates effectiveness on CamVid and WildDash2, establishing a robust WSSS baseline for driving scenes. This framework provides a practical, scalable path toward label-efficient semantic segmentation in real-world driving applications, with clear guidance on dataset-specific design choices and loss balancing.

Abstract

State-of-the-art techniques in weakly-supervised semantic segmentation (WSSS) using image-level labels exhibit severe performance degradation on driving scene datasets such as Cityscapes. To address this challenge, we develop a new WSSS framework tailored to driving scene datasets. Based on extensive analysis of dataset characteristics, we employ Contrastive Language-Image Pre-training (CLIP) as our baseline to obtain pseudo-masks. However, CLIP introduces two key challenges: (1) pseudo-masks from CLIP lack in representing small object classes, and (2) these masks contain notable noise. We propose solutions for each issue as follows. (1) We devise Global-Local View Training that seamlessly incorporates small-scale patches during model training, thereby enhancing the model's capability to handle small-sized yet critical objects in driving scenes (e.g., traffic light). (2) We introduce Consistency-Aware Region Balancing (CARB), a novel technique that discerns reliable and noisy regions through evaluating the consistency between CLIP masks and segmentation predictions. It prioritizes reliable pixels over noisy pixels via adaptive loss weighting. Notably, the proposed method achieves 51.8\% mIoU on the Cityscapes test dataset, showcasing its potential as a strong WSSS baseline on driving scene datasets. Experimental results on CamVid and WildDash2 demonstrate the effectiveness of our method across diverse datasets, even with small-scale datasets or visually challenging conditions. The code is available at https://github.com/k0u-id/CARB.

Weakly Supervised Semantic Segmentation for Driving Scenes

TL;DR

and demonstrates effectiveness on CamVid and WildDash2, establishing a robust WSSS baseline for driving scenes. This framework provides a practical, scalable path toward label-efficient semantic segmentation in real-world driving applications, with clear guidance on dataset-specific design choices and loss balancing.

Abstract

Paper Structure (25 sections, 7 equations, 7 figures, 4 tables)

This paper contains 25 sections, 7 equations, 7 figures, 4 tables.

Introduction
Related Work
Earlier Works in WSSS.
CLIP-based Segmentation.
Uncertainty Estimation.
Statistics of Datasets
Method
Global-local View Training
Consistency-aware Region Balancing
Overall Training.
Experiments
Experimental Setup
Dataset & Evaluation Metric.
Implementation Detail.
Ablation Study
...and 10 more sections

Figures (7)

Figure 1: Dataset statistics for Cityscapes, CamVid, MS COCO, and PASCAL VOC. (a) Counting the number of images given by the number of classes in a single image. (b) Histogram of co-occurrence ratio between classes. (c) The number of positive and negative images for each class.
Figure 2: Overall framework of proposed method. (Global-local View Training) CLIP gives different pseudo masks for cropping and resizing. (CARB) The pseudo-mask is divided into the consistent / inconsistent regions and the high loss of inconsistent regions is suppressed via adaptive region balancing.
Figure 3: Pseudo-masks after resizing and cropping. (a) The original CLIP mask. (b) CLIP mask with resize ratio 2. (c) The concatenation of quarter-size cropped CLIP masks (d) The mask applying both operations. For visual clarity, we modify color palette of motorcycle to cyan in this figure.
Figure 4: The characteristics of two different masks. (a) The mask from CLIP contains small and blob-like noisy regions. (b) The output mask from the segmentation network is more systematic. We identify reliable regions (c) based on prediction consistency between (a) and (b).
Figure 5: Changes in (a) loss and (b) area of consistent/inconsistent regions during training. Adaptive region balancing is applied from 16K iteration, affecting the training dynamics.
...and 2 more figures

Weakly Supervised Semantic Segmentation for Driving Scenes

TL;DR

Abstract

Weakly Supervised Semantic Segmentation for Driving Scenes

Authors

TL;DR

Abstract

Table of Contents

Figures (7)