Sub-Clustering for Class Distance Recalculation in Long-Tailed Drug Classification
Yujia Su, Xinjie Li, Lionel Z. Wang
TL;DR
The paper tackles long-tailed drug classification by showing that tail classes are not universally harder to identify, owing to distinctive molecular structures. It introduces a sub-cluster supervised contrastive learning framework combined with a dynamic inter-class distance weighting mechanism, using class and subcluster centroids to compute separability-based loss weights $\hat{\omega}_c$ and $\hat{\omega'}_c$ that adapt during training. Key contributions include revealing distance-based identifiability as a better indicator of classification difficulty, designing a two-tier contrastive objective, and fusing global class separability with local subcluster distributions. Empirical results on USPTO-50K, HIV, and SBAP demonstrate improved tail-class performance without sacrificing head-class accuracy, offering a scalable approach for real-world, imbalanced drug discovery tasks.
Abstract
In the real world, long-tailed data distributions are prevalent, making it challenging for models to effectively learn and classify tail classes. However, we discover that in the field of drug chemistry, certain tail classes exhibit higher identifiability during training due to their unique molecular structural features, a finding that significantly contrasts with the conventional understanding that tail classes are generally difficult to identify. Existing imbalance learning methods, such as resampling and cost-sensitive reweighting, overly rely on sample quantity priors, causing models to excessively focus on tail classes at the expense of head class performance. To address this issue, we propose a novel method that breaks away from the traditional static evaluation paradigm based on sample size. Instead, we establish a dynamical inter-class separability metric using feature distances between different classes. Specifically, we employ a sub-clustering contrastive learning approach to thoroughly learn the embedding features of each class, and we dynamically compute the distances between class embeddings to capture the relative positional evolution of samples from different classes in the feature space, thereby rebalancing the weights of the classification loss function. We conducted experiments on multiple existing long-tailed drug datasets and achieved competitive results by improving the accuracy of tail classes without compromising the performance of dominant classes.
