Table of Contents
Fetching ...

Labeled Data Selection for Category Discovery

Bingchen Zhao, Nico Lang, Serge Belongie, Oisin Mac Aodha

TL;DR

The paper tackles the problem of how labeled data influences generalized category discovery in unlabeled visual data. It shows that conventional minded data selection (favoring the most similar source data) can hurt discovery, and proposes two unsupervised weighting schemes—binning and Beta-weighting—to automatically down-weight unsuitable labeled data during training. Across multiple discovery methods and fine-grained benchmarks, these data-selection strategies achieve state-of-the-art results, demonstrating that careful labeled-data selection can surpass using larger, static labeled sets. The work highlights a practical, scalable path to improve category discovery by focusing on the quality and relevance of labeled supervision rather than sheer quantity.

Abstract

Category discovery methods aim to find novel categories in unlabeled visual data. At training time, a set of labeled and unlabeled images are provided, where the labels correspond to the categories present in the images. The labeled data provides guidance during training by indicating what types of visual properties and features are relevant for performing discovery in the unlabeled data. As a result, changing the categories present in the labeled set can have a large impact on what is ultimately discovered in the unlabeled set. Despite its importance, the impact of labeled data selection has not been explored in the category discovery literature to date. We show that changing the labeled data can significantly impact discovery performance. Motivated by this, we propose two new approaches for automatically selecting the most suitable labeled data based on the similarity between the labeled and unlabeled data. Our observation is that, unlike in conventional supervised transfer learning, the best labeled is neither too similar, nor too dissimilar, to the unlabeled categories. Our resulting approaches obtains state-of-the-art discovery performance across a range of challenging fine-grained benchmark datasets.

Labeled Data Selection for Category Discovery

TL;DR

The paper tackles the problem of how labeled data influences generalized category discovery in unlabeled visual data. It shows that conventional minded data selection (favoring the most similar source data) can hurt discovery, and proposes two unsupervised weighting schemes—binning and Beta-weighting—to automatically down-weight unsuitable labeled data during training. Across multiple discovery methods and fine-grained benchmarks, these data-selection strategies achieve state-of-the-art results, demonstrating that careful labeled-data selection can surpass using larger, static labeled sets. The work highlights a practical, scalable path to improve category discovery by focusing on the quality and relevance of labeled supervision rather than sheer quantity.

Abstract

Category discovery methods aim to find novel categories in unlabeled visual data. At training time, a set of labeled and unlabeled images are provided, where the labels correspond to the categories present in the images. The labeled data provides guidance during training by indicating what types of visual properties and features are relevant for performing discovery in the unlabeled data. As a result, changing the categories present in the labeled set can have a large impact on what is ultimately discovered in the unlabeled set. Despite its importance, the impact of labeled data selection has not been explored in the category discovery literature to date. We show that changing the labeled data can significantly impact discovery performance. Motivated by this, we propose two new approaches for automatically selecting the most suitable labeled data based on the similarity between the labeled and unlabeled data. Our observation is that, unlike in conventional supervised transfer learning, the best labeled is neither too similar, nor too dissimilar, to the unlabeled categories. Our resulting approaches obtains state-of-the-art discovery performance across a range of challenging fine-grained benchmark datasets.
Paper Structure (27 sections, 16 equations, 4 figures, 12 tables)

This paper contains 27 sections, 16 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Discovery benefits from selecting a subset of the labeled data. Category discovery methods rely on labeled source data to provide context for discovering visual concepts in unlabeled target data. In the related task of supervised categorization via transfer learning Cui_2018_CVPR, performance on the target data is maximized by selecting the most related concepts from the source data to pre-train the model ( orange line). We show that discovery methods behave very differently ( pink line). Learning representations from source data that is too similar to the target actually hinders discovery. Instead, it is preferable to select data that is neither too similar nor too dissimilar. The arrows $\downarrow$$\downarrow$$\downarrow$ (right) indicate the locations of the subsets (left) with the same color.
  • Figure 2: (Left) The beta distribution is used to generate weights for the labeled data. (Right) Overview of our labeled data selection process for category discovery. The labeled data is weighted based on the distance to the unlabeled data. The weight is used to change the influence of labeled categories during training. The green and orange arrows on the left panel denote the distance between labeled to unlabeled data for two instances illustrated using same color in the first step on the right panel.
  • Figure 3: Percentage of errors on three datasets when using 'Similar', 'Medium', 'Dissimilar', and 'OOD' labeled training sets. 'Misclassified as New' means that an example from a 'New' category is assigned to another 'New' one, and 'Misclassified as Old' means it is assigned to an 'Old' one.
  • Figure A1: Overview of our binning labeled data selection process for category discovery. We first discard labeled source data based on a threshold calculated from within the unlabeled target data. Next we discard data that is too similar to the unlabeled target data. The remaining labeled source data, along with the unlabeled target data, is then fed to a category discovery method.