Table of Contents
Fetching ...

Pooling Image Datasets With Multiple Covariate Shift and Imbalance

Sotirios Panagiotis Chytas, Vishnu Suresh Lokhande, Peiran Li, Vikas Singh

TL;DR

This paper shows how viewing this problem from the perspective of Category theory provides a simple and effective solution that completely avoids elaborate multi-stage training pipelines that would otherwise be needed.

Abstract

Small sample sizes are common in many disciplines, which necessitates pooling roughly similar datasets across multiple institutions to study weak but relevant associations between images and disease outcomes. Such data often manifest shift/imbalance in covariates (i.e., secondary non-imaging data). Controlling for such nuisance variables is common within standard statistical analysis, but the ideas do not directly apply to overparameterized models. Consequently, recent work has shown how strategies from invariant representation learning provides a meaningful starting point, but the current repertoire of methods is limited to accounting for shifts/imbalances in just a couple of covariates at a time. In this paper, we show how viewing this problem from the perspective of Category theory provides a simple and effective solution that completely avoids elaborate multi-stage training pipelines that would otherwise be needed. We show the effectiveness of this approach via extensive experiments on real datasets. Further, we discuss how this style of formulation offers a unified perspective on at least 5+ distinct problem settings, from self-supervised learning to matching problems in 3D reconstruction.

Pooling Image Datasets With Multiple Covariate Shift and Imbalance

TL;DR

This paper shows how viewing this problem from the perspective of Category theory provides a simple and effective solution that completely avoids elaborate multi-stage training pipelines that would otherwise be needed.

Abstract

Small sample sizes are common in many disciplines, which necessitates pooling roughly similar datasets across multiple institutions to study weak but relevant associations between images and disease outcomes. Such data often manifest shift/imbalance in covariates (i.e., secondary non-imaging data). Controlling for such nuisance variables is common within standard statistical analysis, but the ideas do not directly apply to overparameterized models. Consequently, recent work has shown how strategies from invariant representation learning provides a meaningful starting point, but the current repertoire of methods is limited to accounting for shifts/imbalances in just a couple of covariates at a time. In this paper, we show how viewing this problem from the perspective of Category theory provides a simple and effective solution that completely avoids elaborate multi-stage training pipelines that would otherwise be needed. We show the effectiveness of this approach via extensive experiments on real datasets. Further, we discuss how this style of formulation offers a unified perspective on at least 5+ distinct problem settings, from self-supervised learning to matching problems in 3D reconstruction.
Paper Structure (24 sections, 13 equations, 13 figures, 11 tables, 1 algorithm)

This paper contains 24 sections, 13 equations, 13 figures, 11 tables, 1 algorithm.

Figures (13)

  • Figure 1: Problem overview: When pooling image datasets from different sources, there may be differences in the distribution of covariates. Covariates are secondary data for each individual that influences the images systematically. Top row shows MR images (say, different scanners). Second row shows how the covariate distributions (genetic risk, age or gender) varies across scanners (but have a shared support). Learning representations from a pooled dataset in a manner such that covariate variations are accounted for, is challenging.
  • Figure 2: A Category theoretic view of CycleGAN (left) and SimCLR (right)
  • Figure 3: MNIST Example: Modelling the relationships between the digits as linear mappings in the target Category.
  • Figure 4: A diagrammatic representation of equivariance with respect to a single covariate whose change corresponds to "$f$" in the original data space. Our goal is to preserve this structure in the latent space (Category $\mathcal{T}$) in which we model $F(f)$ as a linear transformation $W\in\mathbb{R}^{n\times n}$. In many practical cases the downstream goal is to classify the latent representations, which we formulate with the Functor $C:\mathcal{T}\rightarrow Free(\mathbb{N})$ (or $Free(\mathbb{R})$ for a regression task, etc.)
  • Figure 5: (Left) Minimum Distance $(\mathcal{D})$ and Cosine Similarity $(\mathcal{CS})$ in for age equivariance, compared with Naive and GE. (Right) Minimum Distance $(\mathcal{D})$ and Cosine Similarity $(\mathcal{CS})$ for 5-covariate equivariance. Results are consistently good.
  • ...and 8 more figures

Theorems & Definitions (10)

  • Remark 1
  • Example 2
  • Definition 3
  • Remark 4
  • Definition 5: Equivariance
  • Definition 6
  • Definition 7
  • Definition 8: Invariance
  • Example 9
  • Remark 10