Decoupled-Value Attention for Prior-Data Fitted Networks: GP Inference for Physical Equations
Kaustubh Sharma, Simardeep Singh, Parikshit Pareek
TL;DR
Gaussian Process inference is computationally expensive for large or evolving datasets. The authors propose Decoupled-Value Attention (DVA), which localizes attention to the input space while streaming labels through the value stream, thereby emulating GP updates without fixed kernels. Empirical results show DVA-based PFNs dramatically reduce bias across 1D–10D and scale to 64D power-flow tasks with large speedups, approaching exact GP performance. The findings suggest DVA enables scalable, uncertainty-aware physics surrogates that are architecture-agnostic. This could enable real-time, high-dimensional simulations for complex physical systems such as power grids.
Abstract
Prior-data fitted networks (PFNs) are a promising alternative to time-consuming Gaussian process (GP) inference for creating fast surrogates of physical systems. PFN reduces the computational burden of GP-training by replacing Bayesian inference in GP with a single forward pass of a learned prediction model. However, with standard Transformer attention, PFNs show limited effectiveness on high-dimensional regression tasks. We introduce Decoupled-Value Attention (DVA)-- motivated by the GP property that the function space is fully characterized by the kernel over inputs and the predictive mean is a weighted sum of training targets. DVA computes similarities from inputs only and propagates labels solely through values. Thus, the proposed DVA mirrors the GP update while remaining kernel-free. We demonstrate that PFNs are backbone architecture invariant and the crucial factor for scaling PFNs is the attention rule rather than the architecture itself. Specifically, our results demonstrate that (a) localized attention consistently reduces out-of-sample validation loss in PFNs across different dimensional settings, with validation loss reduced by more than 50% in five- and ten-dimensional cases, and (b) the role of attention is more decisive than the choice of backbone architecture, showing that CNN, RNN and LSTM-based PFNs can perform at par with their Transformer-based counterparts. The proposed PFNs provide 64-dimensional power flow equation approximations with a mean absolute error of the order of E-03, while being over 80x faster than exact GP inference.
