Table of Contents
Fetching ...

Quantile Activation: Correcting a Failure Mode of ML Models

Aditya Challa, Sravan Danda, Laurent Najman, Snehanshu Saha

TL;DR

This work identifies a fundamental failure mode where standard ML models fail to adapt to context distributions or distribution shifts, due to fixed neuron outputs. It introduces Quantile Activation (QAct), a neuron-level activation that maps pre-activations to their percentile within the batch’s context distribution, via $\text{QAct}(z) = F_{z}(z)$ with gradient $\partial \text{QAct}(z)/\partial z = f_{z}(z)$, and grounds neurons to avoid degenerate learning using $z^{\ddagger}$ and KDE-based density estimates. The method extends to dense and convolutional layers, supports multiple loss functions (notably Watershed), and uses a Quantile Classifier for calibrated probabilities, all while maintaining manageable computational costs ($\mathcal{O}(n\log n) + \mathcal{O}(S n_{\tau})$ per neuron). Empirically, QAct demonstrates robust distortion tolerance across CIFAR10C, CIFAR100C, and TinyImagenetC, outperforming ReLU-based baselines and a small DINOv2 in high-distortion regimes and achieving stable calibration, suggesting strong potential for domain-generalization and reliable downstream decision making.

Abstract

Standard ML models fail to infer the context distribution and suitably adapt. For instance, the learning fails when the underlying distribution is actually a mixture of distributions with contradictory labels. Learning also fails if there is a shift between train and test distributions. Standard neural network architectures like MLPs or CNNs are not equipped to handle this. In this article, we propose a simple activation function, quantile activation (QAct), that addresses this problem without significantly increasing computational costs. The core idea is to "adapt" the outputs of each neuron to its context distribution. The proposed quantile activation (QAct) outputs the relative quantile position of neuron activations within their context distribution, diverging from the direct numerical outputs common in traditional networks. A specific case of the above failure mode is when there is an inherent distribution shift, i.e the test distribution differs slightly from the train distribution. We validate the proposed activation function under covariate shifts, using datasets designed to test robustness against distortions. Our results demonstrate significantly better generalization across distortions compared to conventional classifiers and other adaptive methods, across various architectures. Although this paper presents a proof of concept, we find that this approach unexpectedly outperforms DINOv2 (small), despite DINOv2 being trained with a much larger network and dataset.

Quantile Activation: Correcting a Failure Mode of ML Models

TL;DR

This work identifies a fundamental failure mode where standard ML models fail to adapt to context distributions or distribution shifts, due to fixed neuron outputs. It introduces Quantile Activation (QAct), a neuron-level activation that maps pre-activations to their percentile within the batch’s context distribution, via with gradient , and grounds neurons to avoid degenerate learning using and KDE-based density estimates. The method extends to dense and convolutional layers, supports multiple loss functions (notably Watershed), and uses a Quantile Classifier for calibrated probabilities, all while maintaining manageable computational costs ( per neuron). Empirically, QAct demonstrates robust distortion tolerance across CIFAR10C, CIFAR100C, and TinyImagenetC, outperforming ReLU-based baselines and a small DINOv2 in high-distortion regimes and achieving stable calibration, suggesting strong potential for domain-generalization and reliable downstream decision making.

Abstract

Standard ML models fail to infer the context distribution and suitably adapt. For instance, the learning fails when the underlying distribution is actually a mixture of distributions with contradictory labels. Learning also fails if there is a shift between train and test distributions. Standard neural network architectures like MLPs or CNNs are not equipped to handle this. In this article, we propose a simple activation function, quantile activation (QAct), that addresses this problem without significantly increasing computational costs. The core idea is to "adapt" the outputs of each neuron to its context distribution. The proposed quantile activation (QAct) outputs the relative quantile position of neuron activations within their context distribution, diverging from the direct numerical outputs common in traditional networks. A specific case of the above failure mode is when there is an inherent distribution shift, i.e the test distribution differs slightly from the train distribution. We validate the proposed activation function under covariate shifts, using datasets designed to test robustness against distortions. Our results demonstrate significantly better generalization across distortions compared to conventional classifiers and other adaptive methods, across various architectures. Although this paper presents a proof of concept, we find that this approach unexpectedly outperforms DINOv2 (small), despite DINOv2 being trained with a much larger network and dataset.
Paper Structure (59 sections, 1 theorem, 9 equations, 15 figures, 3 tables, 2 algorithms)

This paper contains 59 sections, 1 theorem, 9 equations, 15 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

Let $A$ denote the transformation as described above. Then, as long as ${\bm{A}}^t w = \alpha w$ with $\alpha > 0$ then the class assignment of the linear model with quantile activation does not change under the transformation ${\bm{A}}{\bm{x}} + z$ where $z$ can be arbitrary.

Figures (15)

  • Figure 1: A simple toy example to illustrate where ML systems fail. (a) The distribution is a mixture of Gaussian distributions whose centers $(\mu_1, \mu_2)$ are separated by $30^\circ$. The centers themselves can lie anywhere on the unit circle. (please refer to the text for exact description) The dotted lines indicate the optimal linear classifier for a given $\mu_1$,$\mu_2$, across different values of $\mu_1,\mu_2$.. (b) Histogram of accuracy over $1000$ different combinations of $\mu_1, \mu_2$ for both ReLU activation and after incorporating QAct. Clearly, ReLU activation alone cannot perform better than random guess. Incorporating QAct on the other hand can easily infer the latent $\mu_1,\mu_2$.
  • Figure 2: Comparing TSNE plots of QAct and ReLU activation on CIFAR10C with Gaussian distortions. Observe that QAct maintains the class structure extremely well across distortions, while the usual ReLU activations loses the class structure as severity increases.
  • Figure 3: Intuition behind quantile activation. (a) shows a simple toy distribution of points (blue), it's distortion (orange) and a simple line (red) on which the samples are projected to obtain activations. (b) shows the distribution of the pre-activations. (c) shows the distributions of the activations with QAct of the original distribution (blue). (d) shows the distributions of the activations with QAct under the distorted distribution (orange). Observe that the distributions match perfectly under small distortions. Note that even if the distribution matches perfectly, the quantile activation is actually a deterministic function.
  • Figure 4: Comparing QAct with ReLU activation and DINOv2 (small) on CIFAR10C. We observe that, while at low severity of distortions QAct has a similar accuracy as existing pipelines, at higher levels the drop in accuracy is substantially smaller than existing approaches. With respect to calibration, we observe that the calibration error remains constant (up to standard deviations) across distortions.
  • Figure 5: (a) Dependence on Loss functions. Here we compare watershed with other popular loss functions -- Triplet and Cross-Entropy when used with QAct. We see that watershed performs slightly better with respect to MAP. (b) Comparing QAct with other popular activations -- ReLU/pReLU/SELU with respect to drop in accuracy. (c) Comparing QAct with other popular activations -- ReLU/pReLU/SELU with respect to Calibration Error (Marginal). From both (b) and (c) we can conclude that QAct is notably more robust across distortions than several of the existing activation. All the plots use ResNet18 with CIFAR10C dataset.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Proposition 1