Table of Contents
Fetching ...

Test-Time Augmentation Meets Variational Bayes

Masanari Kimura, Howard Bondell

TL;DR

A weighted version of the Test-Time Augmentation can be formalized in a variational Bayesian framework based on the contribution of each data augmentation and it is demonstrated that optimizing the weights to maximize the marginal log-likelihood suppresses candidates of unwanted data augmentations at the test phase.

Abstract

Data augmentation is known to contribute significantly to the robustness of machine learning models. In most instances, data augmentation is utilized during the training phase. Test-Time Augmentation (TTA) is a technique that instead leverages these data augmentations during the testing phase to achieve robust predictions. More precisely, TTA averages the predictions of multiple data augmentations of an instance to produce a final prediction. Although the effectiveness of TTA has been empirically reported, it can be expected that the predictive performance achieved will depend on the set of data augmentation methods used during testing. In particular, the data augmentation methods applied should make different contributions to performance. That is, it is anticipated that there may be differing degrees of contribution in the set of data augmentation methods used for TTA, and these could have a negative impact on prediction performance. In this study, we consider a weighted version of the TTA based on the contribution of each data augmentation. Some variants of TTA can be regarded as considering the problem of determining the appropriate weighting. We demonstrate that the determination of the coefficients of this weighted TTA can be formalized in a variational Bayesian framework. We also show that optimizing the weights to maximize the marginal log-likelihood suppresses candidates of unwanted data augmentations at the test phase.

Test-Time Augmentation Meets Variational Bayes

TL;DR

A weighted version of the Test-Time Augmentation can be formalized in a variational Bayesian framework based on the contribution of each data augmentation and it is demonstrated that optimizing the weights to maximize the marginal log-likelihood suppresses candidates of unwanted data augmentations at the test phase.

Abstract

Data augmentation is known to contribute significantly to the robustness of machine learning models. In most instances, data augmentation is utilized during the training phase. Test-Time Augmentation (TTA) is a technique that instead leverages these data augmentations during the testing phase to achieve robust predictions. More precisely, TTA averages the predictions of multiple data augmentations of an instance to produce a final prediction. Although the effectiveness of TTA has been empirically reported, it can be expected that the predictive performance achieved will depend on the set of data augmentation methods used during testing. In particular, the data augmentation methods applied should make different contributions to performance. That is, it is anticipated that there may be differing degrees of contribution in the set of data augmentation methods used for TTA, and these could have a negative impact on prediction performance. In this study, we consider a weighted version of the TTA based on the contribution of each data augmentation. Some variants of TTA can be regarded as considering the problem of determining the appropriate weighting. We demonstrate that the determination of the coefficients of this weighted TTA can be formalized in a variational Bayesian framework. We also show that optimizing the weights to maximize the marginal log-likelihood suppresses candidates of unwanted data augmentations at the test phase.
Paper Structure (14 sections, 39 equations, 7 figures, 4 tables)

This paper contains 14 sections, 39 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Test-Time Augmentation as Bayesian mixture model. Assuming that the transformed instances acquired by each data augmentation follow some probability distribution, the TTA procedure can be regarded as sampling from their mixture models.
  • Figure 2: Some example instances in the CIFAR10-N wei2021learning dataset. Each instance in this dataset has three human annotations, which are often inconsistent.
  • Figure 3: Plots of the distributions of points induced by mixup and cutmix (Gaussian distribution case). The black dots represent the input $\bm{x}$, and the figure shows the distributions induced by $\psi_M(\bm{x})$ and $\psi_C(\bm{x})$ when those $\bm{x}$ are fixed.
  • Figure 4: Plots of the distributions of points induced by mixup and cutmix (Gamma distribution case). The black dots represent the input $\bm{x}$, and the figure shows the distributions induced by $\psi_M(\bm{x})$ and $\psi_C(\bm{x})$ when those $\bm{x}$ are fixed.
  • Figure 5: Optimization of VB-TTA. The first row shows the history of the optimization of the weight coefficients. The second row shows the evolution of the weight coefficients assigned to each data augmentation during the optimization process.
  • ...and 2 more figures