Quantile Activation: Correcting a Failure Mode of ML Models
Aditya Challa, Sravan Danda, Laurent Najman, Snehanshu Saha
TL;DR
This work identifies a fundamental failure mode where standard ML models fail to adapt to context distributions or distribution shifts, due to fixed neuron outputs. It introduces Quantile Activation (QAct), a neuron-level activation that maps pre-activations to their percentile within the batch’s context distribution, via $\text{QAct}(z) = F_{z}(z)$ with gradient $\partial \text{QAct}(z)/\partial z = f_{z}(z)$, and grounds neurons to avoid degenerate learning using $z^{\ddagger}$ and KDE-based density estimates. The method extends to dense and convolutional layers, supports multiple loss functions (notably Watershed), and uses a Quantile Classifier for calibrated probabilities, all while maintaining manageable computational costs ($\mathcal{O}(n\log n) + \mathcal{O}(S n_{\tau})$ per neuron). Empirically, QAct demonstrates robust distortion tolerance across CIFAR10C, CIFAR100C, and TinyImagenetC, outperforming ReLU-based baselines and a small DINOv2 in high-distortion regimes and achieving stable calibration, suggesting strong potential for domain-generalization and reliable downstream decision making.
Abstract
Standard ML models fail to infer the context distribution and suitably adapt. For instance, the learning fails when the underlying distribution is actually a mixture of distributions with contradictory labels. Learning also fails if there is a shift between train and test distributions. Standard neural network architectures like MLPs or CNNs are not equipped to handle this. In this article, we propose a simple activation function, quantile activation (QAct), that addresses this problem without significantly increasing computational costs. The core idea is to "adapt" the outputs of each neuron to its context distribution. The proposed quantile activation (QAct) outputs the relative quantile position of neuron activations within their context distribution, diverging from the direct numerical outputs common in traditional networks. A specific case of the above failure mode is when there is an inherent distribution shift, i.e the test distribution differs slightly from the train distribution. We validate the proposed activation function under covariate shifts, using datasets designed to test robustness against distortions. Our results demonstrate significantly better generalization across distortions compared to conventional classifiers and other adaptive methods, across various architectures. Although this paper presents a proof of concept, we find that this approach unexpectedly outperforms DINOv2 (small), despite DINOv2 being trained with a much larger network and dataset.
