Table of Contents
Fetching ...

A Practical Approach to Novel Class Discovery in Tabular Data

Colin Troisemaine, Alexandre Reiffers-Masson, Stéphane Gosselin, Vincent Lemaire, Sandrine Vaton

TL;DR

This work addresses Novel Class Discovery (NCD) in tabular data under realistic constraints by proposing a hyperparameter-optimization pipeline based on k-fold cross-validation with hidden known classes, and introducing Projection-Based NCD (PBN), a compact deep model that learns a latent space conducive to clustering novel classes. It also adapts two unsupervised clustering methods (NCD k-means and NCD Spectral Clustering) to leverage information from known classes, and demonstrates that the latent space of PBN enables reliable estimation of the number of novel classes $C^u$ via cluster validity indices. Through experiments on seven tabular datasets, the paper shows that PBN achieves state-of-the-art performance in realistic settings where novel labels are unavailable for tuning, while NCD SC remains competitive and NCD k-means provides a simple, fast alternative. The work provides a practical, open-world approach to NCD on tabular data, with robust hyperparameter tuning, latent-space Cu estimation, and accessible code for replication.

Abstract

The problem of Novel Class Discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the $k$-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and performs impressively well under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms ($k$-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.

A Practical Approach to Novel Class Discovery in Tabular Data

TL;DR

This work addresses Novel Class Discovery (NCD) in tabular data under realistic constraints by proposing a hyperparameter-optimization pipeline based on k-fold cross-validation with hidden known classes, and introducing Projection-Based NCD (PBN), a compact deep model that learns a latent space conducive to clustering novel classes. It also adapts two unsupervised clustering methods (NCD k-means and NCD Spectral Clustering) to leverage information from known classes, and demonstrates that the latent space of PBN enables reliable estimation of the number of novel classes via cluster validity indices. Through experiments on seven tabular datasets, the paper shows that PBN achieves state-of-the-art performance in realistic settings where novel labels are unavailable for tuning, while NCD SC remains competitive and NCD k-means provides a simple, fast alternative. The work provides a practical, open-world approach to NCD on tabular data, with robust hyperparameter tuning, latent-space Cu estimation, and accessible code for replication.

Abstract

The problem of Novel Class Discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the -fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and performs impressively well under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.
Paper Structure (24 sections, 5 equations, 7 figures, 11 tables, 3 algorithms)

This paper contains 24 sections, 5 equations, 7 figures, 11 tables, 3 algorithms.

Figures (7)

  • Figure 1: t-SNE plots of the Pendigits dataset depicting the centroids before and after convergence. Note how the centroids of the known classes (the squares) don't move, as they stay the mean class point.
  • Figure 2: NCD Spectral Clustering parameter optimization process.
  • Figure 3: Architecture of the PBN model.
  • Figure 4: The $k$-fold cross-validation approach for hyperparameter optimisation of NCD methods.
  • Figure 5: Comparison between the ARI on the hidden and novel classes. Each point is a different hyperparameter combination.
  • ...and 2 more figures