Table of Contents
Fetching ...

Sub-Clustering for Class Distance Recalculation in Long-Tailed Drug Classification

Yujia Su, Xinjie Li, Lionel Z. Wang

TL;DR

The paper tackles long-tailed drug classification by showing that tail classes are not universally harder to identify, owing to distinctive molecular structures. It introduces a sub-cluster supervised contrastive learning framework combined with a dynamic inter-class distance weighting mechanism, using class and subcluster centroids to compute separability-based loss weights $\hat{\omega}_c$ and $\hat{\omega'}_c$ that adapt during training. Key contributions include revealing distance-based identifiability as a better indicator of classification difficulty, designing a two-tier contrastive objective, and fusing global class separability with local subcluster distributions. Empirical results on USPTO-50K, HIV, and SBAP demonstrate improved tail-class performance without sacrificing head-class accuracy, offering a scalable approach for real-world, imbalanced drug discovery tasks.

Abstract

In the real world, long-tailed data distributions are prevalent, making it challenging for models to effectively learn and classify tail classes. However, we discover that in the field of drug chemistry, certain tail classes exhibit higher identifiability during training due to their unique molecular structural features, a finding that significantly contrasts with the conventional understanding that tail classes are generally difficult to identify. Existing imbalance learning methods, such as resampling and cost-sensitive reweighting, overly rely on sample quantity priors, causing models to excessively focus on tail classes at the expense of head class performance. To address this issue, we propose a novel method that breaks away from the traditional static evaluation paradigm based on sample size. Instead, we establish a dynamical inter-class separability metric using feature distances between different classes. Specifically, we employ a sub-clustering contrastive learning approach to thoroughly learn the embedding features of each class, and we dynamically compute the distances between class embeddings to capture the relative positional evolution of samples from different classes in the feature space, thereby rebalancing the weights of the classification loss function. We conducted experiments on multiple existing long-tailed drug datasets and achieved competitive results by improving the accuracy of tail classes without compromising the performance of dominant classes.

Sub-Clustering for Class Distance Recalculation in Long-Tailed Drug Classification

TL;DR

The paper tackles long-tailed drug classification by showing that tail classes are not universally harder to identify, owing to distinctive molecular structures. It introduces a sub-cluster supervised contrastive learning framework combined with a dynamic inter-class distance weighting mechanism, using class and subcluster centroids to compute separability-based loss weights and that adapt during training. Key contributions include revealing distance-based identifiability as a better indicator of classification difficulty, designing a two-tier contrastive objective, and fusing global class separability with local subcluster distributions. Empirical results on USPTO-50K, HIV, and SBAP demonstrate improved tail-class performance without sacrificing head-class accuracy, offering a scalable approach for real-world, imbalanced drug discovery tasks.

Abstract

In the real world, long-tailed data distributions are prevalent, making it challenging for models to effectively learn and classify tail classes. However, we discover that in the field of drug chemistry, certain tail classes exhibit higher identifiability during training due to their unique molecular structural features, a finding that significantly contrasts with the conventional understanding that tail classes are generally difficult to identify. Existing imbalance learning methods, such as resampling and cost-sensitive reweighting, overly rely on sample quantity priors, causing models to excessively focus on tail classes at the expense of head class performance. To address this issue, we propose a novel method that breaks away from the traditional static evaluation paradigm based on sample size. Instead, we establish a dynamical inter-class separability metric using feature distances between different classes. Specifically, we employ a sub-clustering contrastive learning approach to thoroughly learn the embedding features of each class, and we dynamically compute the distances between class embeddings to capture the relative positional evolution of samples from different classes in the feature space, thereby rebalancing the weights of the classification loss function. We conducted experiments on multiple existing long-tailed drug datasets and achieved competitive results by improving the accuracy of tail classes without compromising the performance of dominant classes.

Paper Structure

This paper contains 14 sections, 14 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The relationship between the number of samples of different classification labels and classification accuracy in the dataset.
  • Figure 2: Overview of our framework. First, given drug samples represented as graph structures, we obtain the embedded feature representations of the samples through a feature extraction network. Next, we perform subcluster partitioning on the samples of the head classes in the feature space, ensuring that the sample size of each subcluster is comparable to that of the tail classes. These subcluster assignments are fed back into the optimization of the feature extraction network through a subcluster loss, guiding the contrastive learning process. For both classes and subclusters, we calculate their inter-class distances to assess classification difficulty. Finally, the learned embedded features are input into a classifier (such as a multi-layer perceptron), and the classification loss is dynamically re-weighted using the computed classification difficulty weights, thereby enabling targeted optimization for hard-to-classify samples.