TULiP: Test-time Uncertainty Estimation via Linearization and Weight Perturbation
Yuhui Zhang, Dongshen Wu, Yuichiro Wada, Takafumi Kanamori
TL;DR
TULiP introduces a theoretically grounded, post-hoc uncertainty estimator for OOD detection by analyzing perturbations in the linearized training dynamics via the Neural Tangent Kernel. By hypothetically perturbing the network before convergence and bounding the resulting fluctuations, it constructs a surrogate posterior ensemble whose predictions are combined to yield robust OOD scores, notably enhancing near-distribution detection without access to training data. The method is implemented through practical steps to estimate Jacobians, calibrate kernel-related quantities, and build a Surrogate Posterior Envelope (SPE), with empirical validation on OpenOOD benchmarks showing state-of-the-art or competitive performance across near and far OOD settings. The work demonstrates a principled link between training dynamics and inference-time uncertainty, offering a scalable, plug-and-play tool that improves existing post-hoc detectors and opens directions for extending the framework to broader learning paradigms.
Abstract
A reliable uncertainty estimation method is the foundation of many modern out-of-distribution (OOD) detectors, which are critical for safe deployments of deep learning models in the open world. In this work, we propose TULiP, a theoretically-driven post-hoc uncertainty estimator for OOD detection. Our approach considers a hypothetical perturbation applied to the network before convergence. Based on linearized training dynamics, we bound the effect of such perturbation, resulting in an uncertainty score computable by perturbing model parameters. Ultimately, our approach computes uncertainty from a set of sampled predictions. We visualize our bound on synthetic regression and classification datasets. Furthermore, we demonstrate the effectiveness of TULiP using large-scale OOD detection benchmarks for image classification. Our method exhibits state-of-the-art performance, particularly for near-distribution samples.
