Table of Contents
Fetching ...

SynSeg: Feature Synergy for Multi-Category Contrastive Learning in End-to-End Open-Vocabulary Semantic Segmentation

Weichen Zhang, Kebin Liu, Fan Dang, Zhui Zhu, Xikai Sun, Yunhao Liu

TL;DR

This work tackles open-vocabulary semantic segmentation under weak supervision by introducing SynSeg, which marries Multi-Category Contrastive Learning (MCCL) with a Feature Synergy Structure (FSS). MCCL provides a richer supervisory signal by enforcing intra- and inter-category alignment and separation across multiple categories within the same image, while FSS reconstructs discriminative category-aware features using semantic-activation maps and FiLM-guided fusion, avoiding reliance on repeated passes through large pretrained encoders. The method operates end-to-end without mid-term outputs from large models, enabling real-time inference while maintaining high localization and discrimination performance. Experiments on five OVSS benchmarks show state-of-the-art results, with substantial gains over SOTA baselines and robust qualitative behavior across thresholds and scenes.

Abstract

Semantic segmentation in open-vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories. Existing weakly-supervised methods often rely on category-specific supervision and ill-suited feature construction methods for contrastive learning, leading to semantic misalignment and poor performance. In this work, we propose a novel weakly-supervised approach, SynSeg, to address the challenges. SynSeg performs Multi-Category Contrastive Learning (MCCL) as a stronger training signal with a new feature reconstruction framework named Feature Synergy Structure (FSS). Specifically, MCCL strategy robustly combines both intra- and inter-category alignment and separation in order to make the model learn the knowledge of correlations from different categories within the same image. Moreover, FSS reconstructs discriminative features for contrastive learning through prior fusion and semantic-activation-map enhancement, effectively avoiding the foreground bias introduced by the visual encoder. Furthermore, SynSeg is a lightweight end-to-end solution without using any mid-term output from large-scale pretrained models and capable for real-time inference. In general, SynSeg effectively improves the abilities in semantic localization and discrimination under weak supervision in an efficient manner. Extensive experiments on benchmarks demonstrate that our method outperforms state-of-the-art (SOTA) performance. Particularly, SynSeg achieves higher accuracy than SOTA baselines with a ratio from 6.9\% up to 26.2\%.

SynSeg: Feature Synergy for Multi-Category Contrastive Learning in End-to-End Open-Vocabulary Semantic Segmentation

TL;DR

This work tackles open-vocabulary semantic segmentation under weak supervision by introducing SynSeg, which marries Multi-Category Contrastive Learning (MCCL) with a Feature Synergy Structure (FSS). MCCL provides a richer supervisory signal by enforcing intra- and inter-category alignment and separation across multiple categories within the same image, while FSS reconstructs discriminative category-aware features using semantic-activation maps and FiLM-guided fusion, avoiding reliance on repeated passes through large pretrained encoders. The method operates end-to-end without mid-term outputs from large models, enabling real-time inference while maintaining high localization and discrimination performance. Experiments on five OVSS benchmarks show state-of-the-art results, with substantial gains over SOTA baselines and robust qualitative behavior across thresholds and scenes.

Abstract

Semantic segmentation in open-vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories. Existing weakly-supervised methods often rely on category-specific supervision and ill-suited feature construction methods for contrastive learning, leading to semantic misalignment and poor performance. In this work, we propose a novel weakly-supervised approach, SynSeg, to address the challenges. SynSeg performs Multi-Category Contrastive Learning (MCCL) as a stronger training signal with a new feature reconstruction framework named Feature Synergy Structure (FSS). Specifically, MCCL strategy robustly combines both intra- and inter-category alignment and separation in order to make the model learn the knowledge of correlations from different categories within the same image. Moreover, FSS reconstructs discriminative features for contrastive learning through prior fusion and semantic-activation-map enhancement, effectively avoiding the foreground bias introduced by the visual encoder. Furthermore, SynSeg is a lightweight end-to-end solution without using any mid-term output from large-scale pretrained models and capable for real-time inference. In general, SynSeg effectively improves the abilities in semantic localization and discrimination under weak supervision in an efficient manner. Extensive experiments on benchmarks demonstrate that our method outperforms state-of-the-art (SOTA) performance. Particularly, SynSeg achieves higher accuracy than SOTA baselines with a ratio from 6.9\% up to 26.2\%.

Paper Structure

This paper contains 15 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Training paradigms comparison among previous works and ours. Prior approaches typically adopt either (a) image-text alignment or (b) region-text/region-word alignment, primarily emphasizing intra-category contrastive learning. In contrast, our novel paradigm (c) explicitly incorporates inter-category contrastive learning for improved discriminative capability. Also, our approach does not need to reconstruct training features from a pre-trained visual encoder.
  • Figure 2: The pipeline of SynSeg. It illustrates the proposed Feature Synergy Structure and Multi-Category Contrastive Learning framework. During training, FiLM film fusion module, transformer decoder and the projector stay trainable, while the CLIP clip encoders stay frozen. The projector is here to make sure the feature vectors in an appropriate dimension for later use.
  • Figure 2: Zero-shot semantic segmentation comparisons among weakly-supervised OVSS methods on five representative datasets. Bold indicates best performance; underlined values are second-best. Results are in mIoU (%), which higher is better.
  • Figure 4: Segmentation visual comparisons. The light blue regions indicate the segmentation predictions. The baselines' results are visually compared with our method, SynSeg.