Table of Contents
Fetching ...

Collaborative Learning for Enhanced Unsupervised Domain Adaptation

Minhee Cho, Hyesong Choi, Hayeon Jo, Dongbo Min

TL;DR

Unsupervised Domain Adaptation (UDA) for lightweight models suffers from DSN when using a fixed, domain-adapted teacher in Knowledge Distillation. The authors propose Collaborative Learning for UDA (CLDA), which alternates updating the teacher with a compact student through layer-wise relations and distills refined teacher knowledge back to the student, thereby improving both models. They introduce Layer Saliency Rate (LSR) to quantify per-layer saliency under domain shift and demonstrate superior, consistent gains in semantic segmentation and image classification benchmarks, with modest training overhead and no extra inference cost. The approach offers a practical path to deploy efficient UDA systems in resource-constrained settings and suggests avenues for extending collaboration to broader domain-generalization problems.

Abstract

Unsupervised Domain Adaptation (UDA) endeavors to bridge the gap between a model trained on a labeled source domain and its deployment in an unlabeled target domain. However, current high-performance models demand significant resources, making deployment costs prohibitive and highlighting the need for compact, yet effective models. For UDA of lightweight models, Knowledge Distillation (KD) leveraging a Teacher-Student framework could be a common approach, but we found that domain shift in UDA leads to a significant increase in non-salient parameters in the teacher model, degrading model's generalization ability and transferring misleading information to the student model. Interestingly, we observed that this phenomenon occurs considerably less in the student model. Driven by this insight, we introduce Collaborative Learning for UDA (CLDA), a method that updates the teacher's non-salient parameters using the student model and at the same time utilizes the updated teacher model to improve UDA performance of the student model. Experiments show consistent performance improvements for both student and teacher models. For example, in semantic segmentation, CLDA achieves an improvement of +0.7% mIoU for the teacher model and +1.4% mIoU for the student model compared to the baseline model in the GTA-to-Cityscapes datasets. In the Synthia-to-Cityscapes dataset, it achieves an improvement of +0.8% mIoU and +2.0% mIoU for the teacher and student models, respectively.

Collaborative Learning for Enhanced Unsupervised Domain Adaptation

TL;DR

Unsupervised Domain Adaptation (UDA) for lightweight models suffers from DSN when using a fixed, domain-adapted teacher in Knowledge Distillation. The authors propose Collaborative Learning for UDA (CLDA), which alternates updating the teacher with a compact student through layer-wise relations and distills refined teacher knowledge back to the student, thereby improving both models. They introduce Layer Saliency Rate (LSR) to quantify per-layer saliency under domain shift and demonstrate superior, consistent gains in semantic segmentation and image classification benchmarks, with modest training overhead and no extra inference cost. The approach offers a practical path to deploy efficient UDA systems in resource-constrained settings and suggests avenues for extending collaboration to broader domain-generalization problems.

Abstract

Unsupervised Domain Adaptation (UDA) endeavors to bridge the gap between a model trained on a labeled source domain and its deployment in an unlabeled target domain. However, current high-performance models demand significant resources, making deployment costs prohibitive and highlighting the need for compact, yet effective models. For UDA of lightweight models, Knowledge Distillation (KD) leveraging a Teacher-Student framework could be a common approach, but we found that domain shift in UDA leads to a significant increase in non-salient parameters in the teacher model, degrading model's generalization ability and transferring misleading information to the student model. Interestingly, we observed that this phenomenon occurs considerably less in the student model. Driven by this insight, we introduce Collaborative Learning for UDA (CLDA), a method that updates the teacher's non-salient parameters using the student model and at the same time utilizes the updated teacher model to improve UDA performance of the student model. Experiments show consistent performance improvements for both student and teacher models. For example, in semantic segmentation, CLDA achieves an improvement of +0.7% mIoU for the teacher model and +1.4% mIoU for the student model compared to the baseline model in the GTA-to-Cityscapes datasets. In the Synthia-to-Cityscapes dataset, it achieves an improvement of +0.8% mIoU and +2.0% mIoU for the teacher and student models, respectively.
Paper Structure (31 sections, 8 equations, 5 figures, 7 tables)

This paper contains 31 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Conceptual Comparison of UDA Approaches for a Lightweight Model. While existing KD method for UDA uses a fixed teacher kothandaraman2021domain in (a), our approach updates the teacher through the collaborative learning with the student while allowing the student to fully exploit the enhanced knowledge of the teacher.
  • Figure 2: The distribution of LSR at the layer level. We visualize the distribution of salient and non-salient layers in the fixed teacher model (T), distilled student model (S), and independently trained student model (IS) by measuring the LSR across various UDA methods. Here, the teacher model is a domain-adapted model larger than the student, not a Mean Teacher of the same size. We evaluated DAFormer hoyer2022daformer and HRDA hoyer2022hrda in domain adaptation scenarios where the source domains are GTA and Synthia, and the target domain is Cityscapes. While more than 50% of the teacher model's layers suffer from the DSN issue, this problem is significantly less prevalent in the distilled student model.
  • Figure 3: CKA Heatmap between Teacher and Distilled Student. We compute a CKA heatmap between modules within the teacher and distilled student models. The lower half of the student model functionally corresponds to twice the number of modules in the teacher model. Notably, the upper half of the student model aligns with 2.5 times the number of modules in the teacher model.
  • Figure 4: An illustration of the proposed CLDA framework. For the teacher model, we first identify DSN layers and establish layer-wise relations with the student model. Based on the layer-wise relations, we update non-salient layers to mitigate their DSN problem. The student model then incorporates the refined representations from the updated teacher model, leveraging enhanced generalization to improve performance in the target domain.
  • Figure 5: Comparison on Student Model Size in CLDA. While 'M1' indicates baseline results where the teacher (MiT-B5) and student (MiT-B3) models are trained individually, 'M2' are 'M3' represents the results obtained by applying CLDA. In 'M2', the student model is the same size as the teacher (MiT-B5). 'M3' indicates the original setup of CLDA, where a teacher model (MiT-B5) and a smaller student model (MiT-B3) are used. (a) Comparison of performance based on student model size. (b) Similarity between non-salient teacher layers and compact student layers, as well as the layer similarity within the student model of the same size.