Table of Contents
Fetching ...

Uncertainty-Aware Perceiver

EuiYul Song

TL;DR

The paper argues that the Perceiver's lack of predictive uncertainty and limited generalization evidence weaken its claimed advantages. It introduces five Uncertainty-Aware Perceiver variants—Deep-, SWA-, Snap-, Fast-, and MC-Perceiver—to produce calibrated uncertainty estimates while preserving the model's scalable attention bottleneck. Empirical results on CIFAR-10 and CIFAR-100 show that several variants, especially Deep-Perceiver, achieve higher accuracy and better calibration than the baseline Perceiver, ViT, and ResNet-50, though MC-Perceiver may underperform on some datasets. The work demonstrates that uncertainty-aware extensions can enhance multimodal architectures without sacrificing scalability and outlines directions for pretraining and Bayesian enhancements to further improve uncertainty quantification.

Abstract

The Perceiver makes few architectural assumptions about the relationship among its inputs with quadratic scalability on its memory and computation time. Indeed, the Perceiver model outpaces or is competitive with ResNet-50 and ViT in terms of accuracy to some degree. However, the Perceiver does not take predictive uncertainty and calibration into account. The Perceiver also generalizes its performance on three datasets, three models, one evaluation metric, and one hyper-parameter setting. Worst of all, the Perceiver's relative performance improvement against other models is marginal. Furthermore, its reduction of architectural prior is not substantial; is not equivalent to its quality. Thereby, I invented five mutations of the Perceiver, the Uncertainty-Aware Perceivers, that obtain uncertainty estimates and measured their performance on three metrics. Experimented with CIFAR-10 and CIFAR-100, the Uncertainty-Aware Perceivers make considerable performance enhancement compared to the Perceiver.

Uncertainty-Aware Perceiver

TL;DR

The paper argues that the Perceiver's lack of predictive uncertainty and limited generalization evidence weaken its claimed advantages. It introduces five Uncertainty-Aware Perceiver variants—Deep-, SWA-, Snap-, Fast-, and MC-Perceiver—to produce calibrated uncertainty estimates while preserving the model's scalable attention bottleneck. Empirical results on CIFAR-10 and CIFAR-100 show that several variants, especially Deep-Perceiver, achieve higher accuracy and better calibration than the baseline Perceiver, ViT, and ResNet-50, though MC-Perceiver may underperform on some datasets. The work demonstrates that uncertainty-aware extensions can enhance multimodal architectures without sacrificing scalability and outlines directions for pretraining and Bayesian enhancements to further improve uncertainty quantification.

Abstract

The Perceiver makes few architectural assumptions about the relationship among its inputs with quadratic scalability on its memory and computation time. Indeed, the Perceiver model outpaces or is competitive with ResNet-50 and ViT in terms of accuracy to some degree. However, the Perceiver does not take predictive uncertainty and calibration into account. The Perceiver also generalizes its performance on three datasets, three models, one evaluation metric, and one hyper-parameter setting. Worst of all, the Perceiver's relative performance improvement against other models is marginal. Furthermore, its reduction of architectural prior is not substantial; is not equivalent to its quality. Thereby, I invented five mutations of the Perceiver, the Uncertainty-Aware Perceivers, that obtain uncertainty estimates and measured their performance on three metrics. Experimented with CIFAR-10 and CIFAR-100, the Uncertainty-Aware Perceivers make considerable performance enhancement compared to the Perceiver.
Paper Structure (17 sections, 5 figures, 4 tables, 2 algorithms)

This paper contains 17 sections, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: The Perceiver recursively attends to the input byte array by switching between cross-attention and latent self-attention blocks.
  • Figure 2: Left: SGD optimization using a conventional learning rate schedule. Right: Illustration of Snap-Perceiver using AdamW.
  • Figure 3: Left: Optima of three distinctly trained networks. Middle and Right: A quadratic Bezier curve used by the Fast-Perceiver, connecting the lower two optima.
  • Figure 4: Illustration of MC Dropout.
  • Figure 5: The Deep-Perceiver's predictive performance as a function of ensemble size on CIFAR-10 and CIFAR-100.