Learning to Transform for Generalizable Instance-wise Invariance

Utkarsh Singhal; Carlos Esteves; Ameesh Makadia; Stella X. Yu

Learning to Transform for Generalizable Instance-wise Invariance

Utkarsh Singhal, Carlos Esteves, Ameesh Makadia, Stella X. Yu

TL;DR

This work treats invariance as a prediction problem, and predicts a distribution over transformations can and average over them to make invariant predictions, which forms a flexible, generalizable, and adaptive form of invariance.

Abstract

Computer vision research has long aimed to build systems that are robust to spatial transformations found in natural data. Traditionally, this is done using data augmentation or hard-coding invariances into the architecture. However, too much or too little invariance can hurt, and the correct amount is unknown a priori and dependent on the instance. Ideally, the appropriate invariance would be learned from data and inferred at test-time. We treat invariance as a prediction problem. Given any image, we use a normalizing flow to predict a distribution over transformations and average the predictions over them. Since this distribution only depends on the instance, we can align instances before classifying them and generalize invariance across classes. The same distribution can also be used to adapt to out-of-distribution poses. This normalizing flow is trained end-to-end and can learn a much larger range of transformations than Augerino and InstaAug. When used as data augmentation, our method shows accuracy and robustness gains on CIFAR 10, CIFAR10-LT, and TinyImageNet.

Learning to Transform for Generalizable Instance-wise Invariance

TL;DR

Abstract

Paper Structure (21 sections, 25 equations, 12 figures, 3 tables)

This paper contains 21 sections, 25 equations, 12 figures, 3 tables.

Introduction
Related Work
Methods
Experiments
Supplementary Material
CIFAR-10LT class-wise accuracy
Mean-shift alignment failure due to multimodality
Experimental Details
Normalizing flow model
Base Distribution
Stabilizing training with PID
Pose-embedding CNN
Figure 2: Mario/Iggy experiments
MNIST (Classes $0,1,5,6,9$)
Multi-modal experiments
...and 6 more sections

Figures (12)

Figure 1: Our goal is to build flexible, adaptive, and generalizable invariances. Flexible: The ideal invariance is flexible and instance-dependent. Different objects in different poses require different degrees of invariance. Too much hurts accuracy, and too little hurts robustness. Adaptive: The model should adapt to unexpected (out-of-distribution) poses. The figure above shows mental rotation, a process by which humans align unfamiliar objects in unexpected poses to classify them. Generalizable: Knowledge of invariances should generalize from previous experience, e.g., learning bilateral symmetry for horses and transferring it to zebras.
Figure 2: Our image classification pipeline. The normalizing flow model predicts a distribution over image transformations. Samples from this distribution are passed to a differentiable augmented, which transforms the input image into a set of augmented images. The images are passed to a classifier, and predictions are averaged. Crucially, the transform distribution $g_\phi$ can generalize across classes and datasets.
Figure 3: Our method delivers strong gains for imbalanced classification. On CIFAR10-LT with 5000 to 500 instances per class from head to tail (black curve), our class-agnostic instance-wise transform distribution helps boost the classification accuracy by large margins (red bars) over the standard softmax baseline (blue bars) especially for the tail classes.
Figure 4: Our normalizing flow model can represent input-dependent, multi-modal, and joint distributions over augmentation parameters. (top) We illustrate three samples, each with a different set of correct augmentations. Augerino learns a range shared between all samples, so the learned range is too restrictive. InstaAug learns an instance-wise range but cannot handle a non-axis-aligned augmentation set (middle). In contrast, our model can adapt to the loss landscape and produce the largest possible set. (middle) Augerino augerino fails to learn augmentations in challenging settings. Learned rotation range for a version of Mario-Iggy with $\pm 90^{\circ}$ rotation range. The class boundaries touch each other, so some instances lie close to the boundary, and thus, global augmentation schemes like augerinolila are forced to learn a range of $0$. Our method learns the correct range. (bottom) InstaAug fails to capture the distribution for a multi-modal version of the Mario-Iggy dataset.
Figure 5: Our graphical model inspired by Miller et al.congealing. Shaded nodes represent variables observed in data ($C, I$). In contrast to Miller et al., we only model the inference process and assume that $T$ is instance-wise, not classwise. Our flow model $g_\phi$ predicts image-conditional transform, and the classifier $f_\theta$ classifies the resulting image $L$.
...and 7 more figures

Learning to Transform for Generalizable Instance-wise Invariance

TL;DR

Abstract

Learning to Transform for Generalizable Instance-wise Invariance

Authors

TL;DR

Abstract

Table of Contents

Figures (12)