Table of Contents
Fetching ...

HiLo: A Learning Framework for Generalized Category Discovery Robust to Domain Shifts

Hongjun Wang, Sagar Vaze, Kai Han

TL;DR

The paper tackles Generalized Category Discovery under domain shifts, where unlabelled data may come from multiple domains and include both seen and novel categories. It introduces HiLo, a learning framework that disentangles low-level domain features and high-level semantic features by minimizing their mutual information $I(z_d; z_s)$, and augments this with PatchMix-based contrastive learning and curriculum sampling to bridge domain gaps while preserving semantic structure. Empirical results on DomainNet and the corrupted SSB-C benchmark show that HiLo substantially outperforms state-of-the-art GCD methods, verifying the effectiveness of domain–semantic disentanglement, patch-based augmentation, and progressive domain exposure. The work provides a principled approach to robust open-world category discovery with domain shifts, with practical impact for web-scale, multi-domain data and cross-domain applications. Key technical contributions include the Jensen–Shannon MI estimator for feature disentanglement, PatchMix adaptations for GCD, and a curriculum strategy that gradually introduces unseen-domain samples during training.

Abstract

Generalized Category Discovery (GCD) is a challenging task in which, given a partially labelled dataset, models must categorize all unlabelled instances, regardless of whether they come from labelled categories or from new ones. In this paper, we challenge a remaining assumption in this task: that all images share the same domain. Specifically, we introduce a new task and method to handle GCD when the unlabelled data also contains images from different domains to the labelled set. Our proposed `HiLo' networks extract High-level semantic and Low-level domain features, before minimizing the mutual information between the representations. Our intuition is that the clusterings based on domain information and semantic information should be independent. We further extend our method with a specialized domain augmentation tailored for the GCD task, as well as a curriculum learning approach. Finally, we construct a benchmark from corrupted fine-grained datasets as well as a large-scale evaluation on DomainNet with real-world domain shifts, reimplementing a number of GCD baselines in this setting. We demonstrate that HiLo outperforms SoTA category discovery models by a large margin on all evaluations.

HiLo: A Learning Framework for Generalized Category Discovery Robust to Domain Shifts

TL;DR

The paper tackles Generalized Category Discovery under domain shifts, where unlabelled data may come from multiple domains and include both seen and novel categories. It introduces HiLo, a learning framework that disentangles low-level domain features and high-level semantic features by minimizing their mutual information , and augments this with PatchMix-based contrastive learning and curriculum sampling to bridge domain gaps while preserving semantic structure. Empirical results on DomainNet and the corrupted SSB-C benchmark show that HiLo substantially outperforms state-of-the-art GCD methods, verifying the effectiveness of domain–semantic disentanglement, patch-based augmentation, and progressive domain exposure. The work provides a principled approach to robust open-world category discovery with domain shifts, with practical impact for web-scale, multi-domain data and cross-domain applications. Key technical contributions include the Jensen–Shannon MI estimator for feature disentanglement, PatchMix adaptations for GCD, and a curriculum strategy that gradually introduces unseen-domain samples during training.

Abstract

Generalized Category Discovery (GCD) is a challenging task in which, given a partially labelled dataset, models must categorize all unlabelled instances, regardless of whether they come from labelled categories or from new ones. In this paper, we challenge a remaining assumption in this task: that all images share the same domain. Specifically, we introduce a new task and method to handle GCD when the unlabelled data also contains images from different domains to the labelled set. Our proposed `HiLo' networks extract High-level semantic and Low-level domain features, before minimizing the mutual information between the representations. Our intuition is that the clusterings based on domain information and semantic information should be independent. We further extend our method with a specialized domain augmentation tailored for the GCD task, as well as a curriculum learning approach. Finally, we construct a benchmark from corrupted fine-grained datasets as well as a large-scale evaluation on DomainNet with real-world domain shifts, reimplementing a number of GCD baselines in this setting. We demonstrate that HiLo outperforms SoTA category discovery models by a large margin on all evaluations.
Paper Structure (33 sections, 6 theorems, 62 equations, 10 figures, 26 tables)

This paper contains 33 sections, 6 theorems, 62 equations, 10 figures, 26 tables.

Key Result

Lemma 1

Consider a symmetric hypothesis class $\mathbb{G}$ defined on the space $\mathcal{X}$, with a VC dimension $d$. Let $\Omega^a$ and $\Omega^b$ be collections of samples under domains $\mathcal{D}_1$ and $\mathcal{D}_2$. $\hat{d}_{\mathbb{G}}(\Omega^a, \Omega^b)$ is the empirical $\mathcal{A}$-distanc

Figures (10)

  • Figure 1: We present a new task where a model must categorize unlabelled instances from both seen and unseen categories, as well as seen and novel domains. In the example above, models are given labels only for the images in green boxes. The models are tasked with categorizing all unlabelled images, including those from different domains (top two rows) and novel categories (rightmost three columns on an orange background).
  • Figure 2: Overview of HiLo framework. Samples are drawn through our proposed curriculum sampling approach, considering the difficulty of each sample. Labelled and unlabelled samples are paired and augmented through PatchMix which we subtly adapt in the embedding space for contrastive learning for GCD. The mixed-up embeddings are then processed by our network with a high-level (for semantic) and low-level (for domain) feature design, allowing for the domain-semantic disentangled feature learning via mutual information minimization.
  • Figure 3: To investigate the effect of features extracted from different layers, we fix the layer for one of the two heads while varying the other on the CUB-C dataset. Features from the first and last layers yield the best performance.
  • Figure 4: Our SSB-C dataset includes 45 distinct corruptions that are algorithmically generated from 9 types of corruptions, covering noise, blur, weather, and digital corruptions. Each type has 5 severity levels.
  • Figure 5: Illustration of PatchMix and loss functions. (a) PatchMix augments the data by mixing up image patches in the embedding space with $\beta$ sampled from Beta distribution. (b) The similarity matrix for representation learning and (c) mixed embedding patches for classification learning are adjusted according to the actual semantic components within the mixed patches, determined by $\alpha$.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 2 more