Table of Contents
Fetching ...

Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation

Ba Hung Ngo, Nhat-Tuong Do-Tran, Tuan-Ngoc Nguyen, Hae-Gon Jeon, Tae Jong Choi

TL;DR

ECB presents a hybrid CNN-ViT approach for domain adaptation that explicitly discovers class-specific decision boundaries by maximizing the discrepancy between two classifiers on a fixed ViT encoder, then leverages a CNN encoder to cluster target features within those boundaries. A co-training loop exchanges pseudo-labels between the ViT and CNN branches to reduce cross-model discrepancies and improve target supervision. Experiments on Office-Home and DomainNet demonstrate state-of-the-art performance in both UDA and SSDA settings, with notable gains from the Finding-to-Conquering stage and the co-training strategy. The work highlights the complementary strengths of ViT's global representations and CNN's local clustering, offering a practical, testing-efficient framework for cross-domain visual recognition. Future work could explore dynamic thresholding to replace fixed pseudo-label confidence levels.

Abstract

Most domain adaptation (DA) methods are based on either a convolutional neural networks (CNNs) or a vision transformers (ViTs). They align the distribution differences between domains as encoders without considering their unique characteristics. For instance, ViT excels in accuracy due to its superior ability to capture global representations, while CNN has an advantage in capturing local representations. This fact has led us to design a hybrid method to fully take advantage of both ViT and CNN, called Explicitly Class-specific Boundaries (ECB). ECB learns CNN on ViT to combine their distinct strengths. In particular, we leverage ViT's properties to explicitly find class-specific decision boundaries by maximizing the discrepancy between the outputs of the two classifiers to detect target samples far from the source support. In contrast, the CNN encoder clusters target features based on the previously defined class-specific boundaries by minimizing the discrepancy between the probabilities of the two classifiers. Finally, ViT and CNN mutually exchange knowledge to improve the quality of pseudo labels and reduce the knowledge discrepancies of these models. Compared to conventional DA methods, our ECB achieves superior performance, which verifies its effectiveness in this hybrid model. The project website can be found https://dotrannhattuong.github.io/ECB/website.

Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation

TL;DR

ECB presents a hybrid CNN-ViT approach for domain adaptation that explicitly discovers class-specific decision boundaries by maximizing the discrepancy between two classifiers on a fixed ViT encoder, then leverages a CNN encoder to cluster target features within those boundaries. A co-training loop exchanges pseudo-labels between the ViT and CNN branches to reduce cross-model discrepancies and improve target supervision. Experiments on Office-Home and DomainNet demonstrate state-of-the-art performance in both UDA and SSDA settings, with notable gains from the Finding-to-Conquering stage and the co-training strategy. The work highlights the complementary strengths of ViT's global representations and CNN's local clustering, offering a practical, testing-efficient framework for cross-domain visual recognition. Future work could explore dynamic thresholding to replace fixed pseudo-label confidence levels.

Abstract

Most domain adaptation (DA) methods are based on either a convolutional neural networks (CNNs) or a vision transformers (ViTs). They align the distribution differences between domains as encoders without considering their unique characteristics. For instance, ViT excels in accuracy due to its superior ability to capture global representations, while CNN has an advantage in capturing local representations. This fact has led us to design a hybrid method to fully take advantage of both ViT and CNN, called Explicitly Class-specific Boundaries (ECB). ECB learns CNN on ViT to combine their distinct strengths. In particular, we leverage ViT's properties to explicitly find class-specific decision boundaries by maximizing the discrepancy between the outputs of the two classifiers to detect target samples far from the source support. In contrast, the CNN encoder clusters target features based on the previously defined class-specific boundaries by minimizing the discrepancy between the probabilities of the two classifiers. Finally, ViT and CNN mutually exchange knowledge to improve the quality of pseudo labels and reduce the knowledge discrepancies of these models. Compared to conventional DA methods, our ECB achieves superior performance, which verifies its effectiveness in this hybrid model. The project website can be found https://dotrannhattuong.github.io/ECB/website.
Paper Structure (12 sections, 8 equations, 5 figures, 4 tables)

This paper contains 12 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of the proposed hybrid network architecture that leverages the strengths of ViT and CNN models.
  • Figure 2: Illustration of a hybrid network with the proposed Finding to Conquering strategy. We use ViT to build $E_{1}$ that drives two classifiers $F_{1}$ and $F_{2}$ to expand class-specific boundaries comprehensively. Besides, we select CNN for the second encoder $E_{2}$ to cluster target features based on the boundaries identified by ViT. These encoders all use two classifiers $F_1$, $F_2$.
  • Figure 3: Illustration of co-training strategy.
  • Figure 4: (a) The quality and quantity of the pseudo labels are generated by the CNN branch on DomainNet under the 3-shot setting of the $rel{\rightarrow}clp$ task using ResNet-34. (b) Comparison between backbone settings on DomainNet under the 3-shot setting. Displayed is the mean accuracy across all domain shift tasks.
  • Figure 5: We visualize feature spaces for the $rel{\rightarrow}skt$ task on DomainNet in the 3-shot scenario using t-SNE tsne. Figures (a) and (b) illustrate the features obtained by CNN and ViT branches before adaptation, respectively. Figures (c) and (d) showcase the distribution changes of the CNN branch depending on the presence of the FTC strategy when implementing our ECB method.