Table of Contents
Fetching ...

DynaGuide: A Generalizable Dynamic Guidance Framework for Unsupervised Semantic Segmentation

Boujemaa Guermazi, Riadh Ksantini, Naimul Khan

TL;DR

DynaGuide addresses unsupervised semantic segmentation by uniting global context from external priors with local CNN-based refinement in a fully unsupervised, end-to-end framework. It introduces a Dynamic Dual-Guidance loss that balances feature similarity, spatial continuity via Hubér smoothing with diagonal terms, and alignment to global pseudo-labels, enabling precise boundaries without ground-truth labels. Across BSD500, PASCAL VOC2012, and COCO, and with both DiffSeg and SegFormer as guidance sources, DynaGuide achieves state-of-the-art mIoU while maintaining a lightweight footprint suitable for real-time or resource-constrained applications. The approach demonstrates strong generalization, modularity, and practical impact for real-world unsupervised segmentation tasks, with clear avenues for extending to video and domain adaptation.

Abstract

Unsupervised image segmentation is a critical task in computer vision. It enables dense scene understanding without human annotations, which is especially valuable in domains where labelled data is scarce. However, existing methods often struggle to reconcile global semantic structure with fine-grained boundary accuracy. This paper introduces DynaGuide, an adaptive segmentation framework that addresses these challenges through a novel dual-guidance strategy and dynamic loss optimization. Building on our previous work, DynaSeg, DynaGuide combines global pseudo-labels from zero-shot models such as DiffSeg or SegFormer with local boundary refinement using a lightweight CNN trained from scratch. This synergy allows the model to correct coarse or noisy global predictions and produce high-precision segmentations. At the heart of DynaGuide is a multi-component loss that dynamically balances feature similarity, Huber-smoothed spatial continuity, including diagonal relationships, and semantic alignment with the global pseudo-labels. Unlike prior approaches, DynaGuide trains entirely without ground-truth labels in the target domain and supports plug-and-play integration of diverse guidance sources. Extensive experiments on BSD500, PASCAL VOC2012, and COCO demonstrate that DynaGuide achieves state-of-the-art performance, improving mIoU by 17.5% on BSD500, 3.1% on PASCAL VOC2012, and 11.66% on COCO. With its modular design, strong generalization, and minimal computational footprint, DynaGuide offers a scalable and practical solution for unsupervised segmentation in real-world settings. Code available at: https://github.com/RyersonMultimediaLab/DynaGuide

DynaGuide: A Generalizable Dynamic Guidance Framework for Unsupervised Semantic Segmentation

TL;DR

DynaGuide addresses unsupervised semantic segmentation by uniting global context from external priors with local CNN-based refinement in a fully unsupervised, end-to-end framework. It introduces a Dynamic Dual-Guidance loss that balances feature similarity, spatial continuity via Hubér smoothing with diagonal terms, and alignment to global pseudo-labels, enabling precise boundaries without ground-truth labels. Across BSD500, PASCAL VOC2012, and COCO, and with both DiffSeg and SegFormer as guidance sources, DynaGuide achieves state-of-the-art mIoU while maintaining a lightweight footprint suitable for real-time or resource-constrained applications. The approach demonstrates strong generalization, modularity, and practical impact for real-world unsupervised segmentation tasks, with clear avenues for extending to video and domain adaptation.

Abstract

Unsupervised image segmentation is a critical task in computer vision. It enables dense scene understanding without human annotations, which is especially valuable in domains where labelled data is scarce. However, existing methods often struggle to reconcile global semantic structure with fine-grained boundary accuracy. This paper introduces DynaGuide, an adaptive segmentation framework that addresses these challenges through a novel dual-guidance strategy and dynamic loss optimization. Building on our previous work, DynaSeg, DynaGuide combines global pseudo-labels from zero-shot models such as DiffSeg or SegFormer with local boundary refinement using a lightweight CNN trained from scratch. This synergy allows the model to correct coarse or noisy global predictions and produce high-precision segmentations. At the heart of DynaGuide is a multi-component loss that dynamically balances feature similarity, Huber-smoothed spatial continuity, including diagonal relationships, and semantic alignment with the global pseudo-labels. Unlike prior approaches, DynaGuide trains entirely without ground-truth labels in the target domain and supports plug-and-play integration of diverse guidance sources. Extensive experiments on BSD500, PASCAL VOC2012, and COCO demonstrate that DynaGuide achieves state-of-the-art performance, improving mIoU by 17.5% on BSD500, 3.1% on PASCAL VOC2012, and 11.66% on COCO. With its modular design, strong generalization, and minimal computational footprint, DynaGuide offers a scalable and practical solution for unsupervised segmentation in real-world settings. Code available at: https://github.com/RyersonMultimediaLab/DynaGuide
Paper Structure (31 sections, 5 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 31 sections, 5 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the DynaGuide segmentation approach, showcasing the Original Image (I), Ground Truth (GT) Mask, and Predicted (Pr) outputs. The results highlight the model’s ability to segment "Things" (Th) and "Stuff" (St) categories, emphasizing its robustness in handling complex visual scenes.
  • Figure 2: Architecture of DynaGuide for unsupervised image segmentation. The model uses a CNN Feature Extractor with convolutional layers, ReLU activation, and Batch Normalization (denoted by Mean ($\mu$) and Std Dev ($\sigma$) boxes) to generate $p$-dimensional feature map. A linear classifier and normalization produce a Normalized Response Map, which is clustered to generate the final segmentation. The Global Pseudo-Label Generator provides external pseudo-labels using a global segmentation model (e.g., DiffSeg or SegFormer) to guide the segmentation process. The Adaptive Multi-Component Loss combines Feature Similarity Loss ($L_{\text{sim}}$), Spatial Continuity Loss ($L_{\text{con}}$) using Huber Loss with diagonal components, and Global Pseudo-Label Guidance Loss ($L_{\text{GP}}$) for iterative refinement.
  • Figure 3: Qualitative comparison of segmentation outputs across different methods: Original Image, Differentiable Clustering li2024differentiable, DynaSeg guermazi2024dynaseg, DynaGuide, and Ground Truth. Detailed examples highlight DynaGuide's ability to address varying brightness, object color differences, and complex shadows.
  • Figure 4: Qualitative comparisons of segmentation outputs for DynaGuide, SegFormer, and ground truth. The first row showcases an airplane scene, and the second row highlights wooden utensils. While SegFormer provides useful attention maps, DynaGuide effectively refines these predictions for enhanced segmentation accuracy.
  • Figure 5: Qualitative comparisons of segmentation outputs using DynaGuide with DiffSeg and SegFormer pseudo-labels. While both approaches perform well, limitations are evident in certain scenarios. DiffSeg often introduces over-segmentation in complex scenes, while SegFormer may oversimplify details. The figure shows results on airplane formations, airshows, buildings, and landscapes, demonstrating the strengths and limitations of both guidance methods.
  • ...and 1 more figures