Table of Contents
Fetching ...

MGAug: Multimodal Geometric Augmentation in Latent Spaces of Image Deformations

Tonmoy Hossain, Miaomiao Zhang

TL;DR

This work tackles the limitation of unimodal geometric data augmentation by learning a multimodal latent space of diffeomorphic deformations through a Gaussian-mixture-augmented variational autoencoder. The method encodes multiple deformation modes in the latent space and decodes them into velocity fields that warp templates via $oldsymbol{J} \sim p(oldsymbol{J}|\boldsymbol{v},I,\boldsymbol{\lambda})$, yielding realistic augmentations that respect the geometry of shapes. The authors formulate an ELBO with a multimodal prior, and jointly optimize MGAug with image analysis tasks (classification on 2D data and segmentation on 3D brain MRIs), demonstrating superior performance over state-of-the-art unimodal and random augmentations, with statistical significance. This approach provides a principled mechanism to synthesize multimodal geometric variations for data augmentation, with strong potential for medical imaging and other domains with limited labeled data.

Abstract

Geometric transformations have been widely used to augment the size of training images. Existing methods often assume a unimodal distribution of the underlying transformations between images, which limits their power when data with multimodal distributions occur. In this paper, we propose a novel model, Multimodal Geometric Augmentation (MGAug), that for the first time generates augmenting transformations in a multimodal latent space of geometric deformations. To achieve this, we first develop a deep network that embeds the learning of latent geometric spaces of diffeomorphic transformations (a.k.a. diffeomorphisms) in a variational autoencoder (VAE). A mixture of multivariate Gaussians is formulated in the tangent space of diffeomorphisms and serves as a prior to approximate the hidden distribution of image transformations. We then augment the original training dataset by deforming images using randomly sampled transformations from the learned multimodal latent space of VAE. To validate the efficiency of our model, we jointly learn the augmentation strategy with two distinct domain-specific tasks: multi-class classification on 2D synthetic datasets and segmentation on real 3D brain magnetic resonance images (MRIs). We also compare MGAug with state-of-the-art transformation-based image augmentation algorithms. Experimental results show that our proposed approach outperforms all baselines by significantly improved prediction accuracy. Our code is publicly available at https://github.com/tonmoy-hossain/MGAug.

MGAug: Multimodal Geometric Augmentation in Latent Spaces of Image Deformations

TL;DR

This work tackles the limitation of unimodal geometric data augmentation by learning a multimodal latent space of diffeomorphic deformations through a Gaussian-mixture-augmented variational autoencoder. The method encodes multiple deformation modes in the latent space and decodes them into velocity fields that warp templates via , yielding realistic augmentations that respect the geometry of shapes. The authors formulate an ELBO with a multimodal prior, and jointly optimize MGAug with image analysis tasks (classification on 2D data and segmentation on 3D brain MRIs), demonstrating superior performance over state-of-the-art unimodal and random augmentations, with statistical significance. This approach provides a principled mechanism to synthesize multimodal geometric variations for data augmentation, with strong potential for medical imaging and other domains with limited labeled data.

Abstract

Geometric transformations have been widely used to augment the size of training images. Existing methods often assume a unimodal distribution of the underlying transformations between images, which limits their power when data with multimodal distributions occur. In this paper, we propose a novel model, Multimodal Geometric Augmentation (MGAug), that for the first time generates augmenting transformations in a multimodal latent space of geometric deformations. To achieve this, we first develop a deep network that embeds the learning of latent geometric spaces of diffeomorphic transformations (a.k.a. diffeomorphisms) in a variational autoencoder (VAE). A mixture of multivariate Gaussians is formulated in the tangent space of diffeomorphisms and serves as a prior to approximate the hidden distribution of image transformations. We then augment the original training dataset by deforming images using randomly sampled transformations from the learned multimodal latent space of VAE. To validate the efficiency of our model, we jointly learn the augmentation strategy with two distinct domain-specific tasks: multi-class classification on 2D synthetic datasets and segmentation on real 3D brain magnetic resonance images (MRIs). We also compare MGAug with state-of-the-art transformation-based image augmentation algorithms. Experimental results show that our proposed approach outperforms all baselines by significantly improved prediction accuracy. Our code is publicly available at https://github.com/tonmoy-hossain/MGAug.
Paper Structure (13 sections, 16 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 13 sections, 16 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Graphical representation of our proposed generative model in multimodal latent space.
  • Figure 2: An overview of our model MGAug for image analysis tasks.
  • Figure 3: Left: Accuracy comparison on various modes $C$ on 2D shape dataset. Right: A comparison of classification performance for all models over increasing number of augmented data taking all (top-right), $75\%$ (bottom-left), and $50\%$ (bottom-right) ground truth images.
  • Figure 4: Examples of DetJac generated by MGAug. Left to right: template, augmented (deformed), DetJac overlaid with augmented images.
  • Figure 5: Top: Accuracy comparison on various modes $C$ on 2D handwritten dataset taking $20\%$ ground-truth images over $3\times$ augmentations (left panel) and classification evaluation over increasing ground-truth images taking $3\times$ augmentations (right panel). Bottom: A comparison of classification performance for all models on 2D handwritten digits over an increasing amount of augmented data taking $10\%$ ground truth images under MLP (left panel) and CNN backbone (right panel). The horizontal dashed line ($\hbox{- -}$) serves as a reference, representing the classification accuracy achieved when utilizing all available training images without any augmentation.
  • ...and 5 more figures