Enhancing selectivity using Wasserstein distance based reweighing
Pratik Worah
TL;DR
This work tackles domain shift between labeled data $\\mathcal{S}$ and target data $\\mathcal{T}$ by designing a scalable greedy reweighting method that tilts the training distribution toward a mixture $(1-\\alpha)\\mathbb{P}_{\\mathcal{S}}+\\alpha\\mathbb{P}_{\\mathcal{T}}$, with the limit distribution of neural weights characterized via the $1$-Wasserstein distance $W_1$. By reducing the exact $W_1$ computation to a greedy (and randomized) minimum-weight bipartite matching, the authors obtain near-linear-time guarantees under a small metric-entropy assumption, supported by randomized-sampling analysis. Theoretical results bound the TV distance between invariant SGD measures by $O(W_1)$ under a covariate-shift-like setting, and show that the greedy approach yields favorable approximation factors that improve with lower entropy. A drug-discovery case study on MNK1/MNK2 demonstrates practical impact: reweighting increases top-MNK2 hit selectivity and yields experimentally validated selective binders, illustrating a scalable transport-based approach to multi-target predictive modeling.
Abstract
Given two labeled data-sets $\mathcal{S}$ and $\mathcal{T}$, we design a simple and efficient greedy algorithm to reweigh the loss function such that the limiting distribution of the neural network weights that result from training on $\mathcal{S}$ approaches the limiting distribution that would have resulted by training on $\mathcal{T}$. On the theoretical side, we prove that when the metric entropy of the input datasets is bounded, our greedy algorithm outputs a close to optimal reweighing, i.e., the two invariant distributions of network weights will be provably close in total variation distance. Moreover, the algorithm is simple and scalable, and we prove bounds on the efficiency of the algorithm as well. As a motivating application, we train a neural net to recognize small molecule binders to MNK2 (a MAP Kinase, responsible for cell signaling) which are non-binders to MNK1 (a highly similar protein). In our example dataset, of the 43 distinct small molecules predicted to be most selective from the enamine catalog, 2 small molecules were experimentally verified to be selective, i.e., they reduced the enzyme activity of MNK2 below 50\% but not MNK1, at 10$μ$M -- a 5\% success rate.
