Incremental Transformer Neural Processes

Philip Mortimer; Cristiana Diaconu; Tommy Rochussen; Bruno Mlodozeniec; Richard E. Turner

Incremental Transformer Neural Processes

Philip Mortimer, Cristiana Diaconu, Tommy Rochussen, Bruno Mlodozeniec, Richard E. Turner

TL;DR

The Incremental TNP is introduced, demonstrating that incTNP achieves the computational benefits of causal masking without sacrificing the consistency required for streaming inference, and unlocking orders-of-magnitude speedups for sequential inference.

Abstract

Neural Processes (NPs), and specifically Transformer Neural Processes (TNPs), have demonstrated remarkable performance across tasks ranging from spatiotemporal forecasting to tabular data modelling. However, many of these applications are inherently sequential, involving continuous data streams such as real-time sensor readings or database updates. In such settings, models should support cheap, incremental updates rather than recomputing internal representations from scratch for every new observation -- a capability existing TNP variants lack. Drawing inspiration from Large Language Models, we introduce the Incremental TNP (incTNP). By leveraging causal masking, Key-Value (KV) caching, and a data-efficient autoregressive training strategy, incTNP matches the predictive performance of standard TNPs while reducing the computational cost of updates from quadratic to linear time complexity. We empirically evaluate our model on a range of synthetic and real-world tasks, including tabular regression and temperature prediction. Our results show that, surprisingly, incTNP delivers performance comparable to -- or better than -- non-causal TNPs while unlocking orders-of-magnitude speedups for sequential inference. Finally, we assess the consistency of the model's updates -- by adapting a metric of ``implicit Bayesianness", we show that incTNP retains a prediction rule as implicitly Bayesian as standard non-causal TNPs, demonstrating that incTNP achieves the computational benefits of causal masking without sacrificing the consistency required for streaming inference.

Incremental Transformer Neural Processes

TL;DR

Abstract

Paper Structure (93 sections, 2 theorems, 34 equations, 19 figures, 5 tables, 1 algorithm)

This paper contains 93 sections, 2 theorems, 34 equations, 19 figures, 5 tables, 1 algorithm.

Introduction
Background
Neural Processes
Conditional Neural Processes (CNPs)
Autoregressive Deployment
Transformer Neural Processes (TNPs)
Efficient Incremental Updates via Causal Masking
Related Work
Architectures for Neural Processes
Incremental Updates in Neural Processes
Streaming Prediction and Implicit Bayesianness
Foundation Models for Tabular Data
Incremental Transformer Neural Processes
Causally Masked Self-Attention
Efficient training stategy
...and 78 more sections

Key Result

Proposition 4.1

Let $p$ be the true data-generating distribution, which we assume to be exchangeable. Then, the following decomposition holds: and, in particular, $D_{\text{KL}}(q_{1:n} \mathrel{\|} p) \geq D_{\text{KL}}(\hat{q}_{1:n} \mathrel{\|} p)$ with equality if and only if $q_{1:n}$ is exchangeable (see proof in app:proofs).

Figures (19)

Figure 1: Computational complexity analysis in the streaming setting (see theoretical analysis in \ref{['app_subsec:temp_streaming_ar']}). Top: As the data stream grows (steps $1, \dots N$), the model updates its context $C$ and predicts targets $T$. In AR mode, the model must perform $N_t$ forward passes (FP) at every step, amplifying the computational burden. Bottom Right: A comparison of the FP mechanics. Standard TNP must re-encode the entire context history at every step (cost $\mathcal{O}(N^2)$), whereas incTNP leverages KV caching to process only the incremental update (cost $\mathcal{O}(N)$). Bottom Left: The cumulative cost over a stream of length $N$. Because TNP recomputes the full attention matrix repeatedly, its total cost scales cubically ($\mathcal{O}(N^3)$). incTNP reduces this to quadratic scaling ($\mathcal{O}(N^2)$), making it the only viable option for long streams, particularly when combined with the expensive AR decoding loop.
Figure 2: Test log-likelihoods ($\uparrow$) on synthetic GP and Tabular datasets across multiple training configurations. incTNP-Seq exhibits reduced variance compared to TNP-D and incTNP, demonstrating the robustness offerred by its training strategy.
Figure 3: Performance (Joint NLL gap relative to optimal) versus implicit Bayesianness (KL Gap). incTNP-Seq achieves a KL gap similar to the non-causal TNP-D, whereas the streaming GP baseline (WISKI) performs worse in both metrics. Diamond markers ($\diamond$) denote the exchangeable versions of the models.
Figure 4: Test log-likelihood ($\uparrow$) on Tabular (real-world datasets). Performance generally increases with stream length for all models. While incTNP and TNP-D remain competitive on smaller datasets, incTNP-Seq demonstrates greater robustness on Protein.
Figure 5: Test log-likelihood ($\uparrow$) on the Tabular (Synthetic) dataset. (Left) Performance improves with more data, with incTNP-Seq consistently outperforming TNP-D and incTNP—even when evaluating beyond the training limit of $1024$ points. (Right) While achieving similar accuracy, the incremental variants demonstrate significantly better scaling efficiency as context size grows.
...and 14 more figures

Theorems & Definitions (3)

Proposition 4.1
Proposition
proof

Incremental Transformer Neural Processes

TL;DR

Abstract

Incremental Transformer Neural Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (3)