Inference from Real-World Sparse Measurements

Arnaud Pannatier; Kyle Matoba; François Fleuret

Inference from Real-World Sparse Measurements

Arnaud Pannatier, Kyle Matoba, François Fleuret

TL;DR

This work proposes an attention-based model focused on robustness and practical applicability, with two key design contributions, that adopts a ViT-like transformer that takes both context points and read-out positions as inputs, eliminating the need for an encoder-decoder structure.

Abstract

Real-world problems often involve complex and unstructured sets of measurements, which occur when sensors are sparsely placed in either space or time. Being able to model this irregular spatiotemporal data and extract meaningful forecasts is crucial. Deep learning architectures capable of processing sets of measurements with positions varying from set to set, and extracting readouts anywhere are methodologically difficult. Current state-of-the-art models are graph neural networks and require domain-specific knowledge for proper setup. We propose an attention-based model focused on robustness and practical applicability, with two key design contributions. First, we adopt a ViT-like transformer that takes both context points and read-out positions as inputs, eliminating the need for an encoder-decoder structure. Second, we use a unified method for encoding both context and read-out positions. This approach is intentionally straightforward and integrates well with other systems. Compared to existing approaches, our model is simpler, requires less specialized knowledge, and does not suffer from a problematic bottleneck effect, all of which contribute to superior performance. We conduct in-depth ablation studies that characterize this problematic bottleneck in the latent representations of alternative models that inhibit information utilization and impede training efficiency. We also perform experiments across various problem domains, including high-altitude wind nowcasting, two-day weather forecasting, fluid dynamics, and heat diffusion. Our attention-based model consistently outperforms state-of-the-art models in handling irregularly sampled data. Notably, our model reduces the root mean square error (RMSE) for wind nowcasting from 9.24 to 7.98 and for heat diffusion tasks from 0.126 to 0.084.

Inference from Real-World Sparse Measurements

TL;DR

Abstract

Paper Structure (46 sections, 15 equations, 12 figures, 13 tables)

This paper contains 46 sections, 15 equations, 12 figures, 13 tables.

Introduction
Related Works
Methodology
Context and Targets
Encoding Scheme
Multi-layer Self-Attention (MSA, Ours)
Baselines
Transformer(s) (TFS)
Graph Element Network(s) (GEN)
Conditional Neural Process(es) (CNP)
Experiments
Understanding Failure Modes
Capacity of Passing Information from Context to Targets
Improving Error Correction
Encoding scheme
...and 31 more sections

Figures (12)

Figure 1: Multi-layer Self-Attention
Figure 1: Description of the context and target sets in the wind nowcasting case. The context set and the target set are time slices separated by a delay, which corresponds to the forecasting window. The underlying space is in that case $\mathbb{X} \subseteq \mathbb{R}^3$ and the context values and target values both represent wind speed and belong to the same space $\mathbb{I} = \mathbb{O} \subseteq \mathbb{R}^2$.
Figure 2: Decay of precision in the wind nowcasting case. RMSE of the different models depending on the forecast duration (lower is better). We ran three experiments varying the pseudorandom number generator seeds for each time window and each model to measure the standard deviation. The error does not increase drastically over the first two hours because of the persistence of the wind and the context values are good predictors of the targets in that regime.
Figure 3: The results of the information retrieval experiment, evaluated using MSE, are considered satisfactory if the MSE is below 0.01. The first three rows depict models without bottlenecks. The x-axis represents datasets organized by increasing frequency, with 'Random' as the extreme case where the context value is independent of its position. Models with bottlenecks are sufficient when the learned function varies minimally in space. Models in italics denote hybrid architectures: GNG represents 'GEN No Graph', maintaining a latent measure per context, and PER indicates a transformer with a perceiver layer jaegle2021perceiver, introducing a bottleneck.
Figure 4: Gradients on the last layer of the encoder corresponding to an artificial error of $\epsilon=10.0$ added to the second output. MSA maintains independent latent representation and gradients are exclusively non-zero for the latent associated with the error. We compare it to different GEN models each initialized with a graph corresponding to a regular grid of size $i \times i$ with $i \in \{1,\dots,8\}$. Due to the bottleneck effect, the gradients corresponding to one error are propagated across different latent vectors for GEN. Even when there are enough latents (GEN 8 $\times$ 8), GEN still disperse attribution because their distance-based conditioning that does not allow for a one-to-one mapping between targets and latents.
...and 7 more figures

Inference from Real-World Sparse Measurements

TL;DR

Abstract

Inference from Real-World Sparse Measurements

Authors

TL;DR

Abstract

Table of Contents

Figures (12)