Table of Contents
Fetching ...

IGL-DT: Iterative Global-Local Feature Learning with Dual-Teacher Semantic Segmentation Framework under Limited Annotation Scheme

Dinh Dai Quan Tran, Hoang-Thien Nguyen, Thanh-Huy Nguyen, Gia-Van To, Tien-Huy Nguyen, Quan Nguyen

TL;DR

IGL-DT addresses semi-supervised semantic segmentation with limited annotations by introducing a dual-teacher framework that fuses global context from SwinUnet with local detail from ResUnet, guided by a Discrepancy Learning mechanism to prevent over-reliance on a single teacher. The student learns through two complementary objectives, Global Context Learning and Local Regional Learning, under a two-stage process that uses Cross Pseudo Supervision and alternating unlabeled-data states, culminating in $\mathcal{L} = \mathcal{L}_{l} + \mathcal{L}_{u}$. Empirical results on Pascal VOC 2012 and Cityscapes demonstrate state-of-the-art performance across multiple label regimes and are supported by ablations confirming the benefits of combining global/local cues and the discrepancy term. The approach highlights the value of heterogeneous backbones in semi-supervised segmentation and offers a scalable path to robust performance when annotations are scarce.

Abstract

Semi-Supervised Semantic Segmentation (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data. Recent advances primarily focus on pseudo-labeling, consistency regularization, and co-training strategies. However, existing methods struggle to balance global semantic representation with fine-grained local feature extraction. To address this challenge, we propose a novel tri-branch semi-supervised segmentation framework incorporating a dual-teacher strategy, named IGL-DT. Our approach employs SwinUnet for high-level semantic guidance through Global Context Learning and ResUnet for detailed feature refinement via Local Regional Learning. Additionally, a Discrepancy Learning mechanism mitigates over-reliance on a single teacher, promoting adaptive feature learning. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior segmentation performance across various data regimes.

IGL-DT: Iterative Global-Local Feature Learning with Dual-Teacher Semantic Segmentation Framework under Limited Annotation Scheme

TL;DR

IGL-DT addresses semi-supervised semantic segmentation with limited annotations by introducing a dual-teacher framework that fuses global context from SwinUnet with local detail from ResUnet, guided by a Discrepancy Learning mechanism to prevent over-reliance on a single teacher. The student learns through two complementary objectives, Global Context Learning and Local Regional Learning, under a two-stage process that uses Cross Pseudo Supervision and alternating unlabeled-data states, culminating in . Empirical results on Pascal VOC 2012 and Cityscapes demonstrate state-of-the-art performance across multiple label regimes and are supported by ablations confirming the benefits of combining global/local cues and the discrepancy term. The approach highlights the value of heterogeneous backbones in semi-supervised segmentation and offers a scalable path to robust performance when annotations are scarce.

Abstract

Semi-Supervised Semantic Segmentation (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data. Recent advances primarily focus on pseudo-labeling, consistency regularization, and co-training strategies. However, existing methods struggle to balance global semantic representation with fine-grained local feature extraction. To address this challenge, we propose a novel tri-branch semi-supervised segmentation framework incorporating a dual-teacher strategy, named IGL-DT. Our approach employs SwinUnet for high-level semantic guidance through Global Context Learning and ResUnet for detailed feature refinement via Local Regional Learning. Additionally, a Discrepancy Learning mechanism mitigates over-reliance on a single teacher, promoting adaptive feature learning. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior segmentation performance across various data regimes.

Paper Structure

This paper contains 16 sections, 14 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Demonstration of our proposed IGL-DT framework, which consists of two training stages: the Teacher Pretraining Stage (light red color) and the Student Learning Stage (light yellow color). In the Teacher Pretraining Stage, the dual-teacher model, initialized with two distinct backbones, is trained using the Cross Pseudo Supervision (CPS) approach to improve weight initialization for generating more reliable pseudo-labels. In the Student Learning Stage, the student model learns from labeled data using both pseudo-labels from the teacher models and ground truth labels, optimized through $\mathcal{L}_{sup}$ and $\mathcal{L}_P$. For unlabeled data, the student leverages Discrepancy Learning, Pseudo-Labeling Loss, and an iterative swapping mechanism between Global Context Learning and Local Regional Learning, guided by both teacher models.
  • Figure 2: Illustration of the Global Context Learning process in the proposed dual-teacher framework. The teacher and student models extract high-level feature representations, which are processed using Global Average Pooling (GAP) to obtain compact embeddings. The Global Loss ($\mathcal{L}_{Glo}$) enforces consistency between the teacher and student representations, guiding the student to capture global contextual information effectively.
  • Figure 3: Illustration of the Local Regional Learning process. The teacher and student extract feature maps, which are then flattened into spatial feature representations. The local feature similarities are computed and aligned using the Local Loss ($\mathcal{L}_{Loc}$), encouraging the student to preserve fine-grained spatial details learned by the teacher
  • Figure 4: Qualitative comparison of semantic segmentation results on the Cityscapes dataset. The first and second columns show the input images and ground truth annotations. The subsequent columns illustrate different methods' predictions. Our method demonstrates superior performance in preserving fine-grained details and accurately segmenting small or occluded objects (highlighted in yellow boxes)
  • Figure 5: Ablation Studies on Backbone Selection