Table of Contents
Fetching ...

Long Tail Image Generation Through Feature Space Augmentation and Iterated Learning

Rafael Elberg, Denis Parra, Mircea Petrache

TL;DR

Problem: long-tailed data in medical imaging hinder reliable learning. Approach: map diffusion latent space $Z$ to a separable sparse domain $Z^s$ via Iterated Learning with sparsified embeddings (SE) and CAM-guided fusion to synthesize tail samples, using a three-stage pipeline (IL, CAM, Inference). Key results: the approach achieves fast, high-quality augmentation with limited diffusion steps $N/d$, obtaining competitive FID, but label propagation during diffusion can degrade tail-class mAP. Significance: provides an efficient, geometry-aware augmentation strategy for underrepresented classes in medical imaging, potentially reducing data collection costs and improving downstream analysis.

Abstract

Image and multimodal machine learning tasks are very challenging to solve in the case of poorly distributed data. In particular, data availability and privacy restrictions exacerbate these hurdles in the medical domain. The state of the art in image generation quality is held by Latent Diffusion models, making them prime candidates for tackling this problem. However, a few key issues still need to be solved, such as the difficulty in generating data from under-represented classes and a slow inference process. To mitigate these issues, we propose a new method for image augmentation in long-tailed data based on leveraging the rich latent space of pre-trained Stable Diffusion Models. We create a modified separable latent space to mix head and tail class examples. We build this space via Iterated Learning of underlying sparsified embeddings, which we apply to task-specific saliency maps via a K-NN approach. Code is available at https://github.com/SugarFreeManatee/Feature-Space-Augmentation-and-Iterated-Learning

Long Tail Image Generation Through Feature Space Augmentation and Iterated Learning

TL;DR

Problem: long-tailed data in medical imaging hinder reliable learning. Approach: map diffusion latent space to a separable sparse domain via Iterated Learning with sparsified embeddings (SE) and CAM-guided fusion to synthesize tail samples, using a three-stage pipeline (IL, CAM, Inference). Key results: the approach achieves fast, high-quality augmentation with limited diffusion steps , obtaining competitive FID, but label propagation during diffusion can degrade tail-class mAP. Significance: provides an efficient, geometry-aware augmentation strategy for underrepresented classes in medical imaging, potentially reducing data collection costs and improving downstream analysis.

Abstract

Image and multimodal machine learning tasks are very challenging to solve in the case of poorly distributed data. In particular, data availability and privacy restrictions exacerbate these hurdles in the medical domain. The state of the art in image generation quality is held by Latent Diffusion models, making them prime candidates for tackling this problem. However, a few key issues still need to be solved, such as the difficulty in generating data from under-represented classes and a slow inference process. To mitigate these issues, we propose a new method for image augmentation in long-tailed data based on leveraging the rich latent space of pre-trained Stable Diffusion Models. We create a modified separable latent space to mix head and tail class examples. We build this space via Iterated Learning of underlying sparsified embeddings, which we apply to task-specific saliency maps via a K-NN approach. Code is available at https://github.com/SugarFreeManatee/Feature-Space-Augmentation-and-Iterated-Learning
Paper Structure (12 sections, 8 equations, 2 figures, 1 table)

This paper contains 12 sections, 8 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Proposed method: Stage 1 (Iterated training), iteratively train a student network $S_i$ to imitate a frozen teacher network $T_i$, which corresponds to the student network of the previous iteration $S_{i-1}$ in mapping the original latent vectors $Z$ to a semantically separable sparse domain $Z^s$. Also, jointly train said student with a classifier $C$ and a decoder $D$ to classify and map vectors from the sparse domain back to the original domain. In Stage 2 (CAM generation), we use EigenCAM to generate class activation maps ($M_i$ for classes I in $[ 1,k]$) for each vector, using the classifier trained in Stage 1. Finally, in stage 3 (Inference), we find a head class near neighbor $Z_h^s$ for each tail class vector $Z_t^s$, and we combine them using their respective Class Activation Maps (CAM) as masks, taking the top activations from the tail vector and the bottom activations from the head vector. Finally, we combine these activations and pass them through $D$ to generate a new tail class vector.
  • Figure 2: Fusion process applied to an image from the tail class Tortuous Aorta (a.1) and one of its neighbor images from the head class Atelectasis (b.1). (a.2) and (b.2) are channelwise Maximum Intensity Projections of the sparse vectors obtained from (a.1) and (b.1) respectively. In (a.3) and (b.3), we use EigenCAM to find attention maps for each sparse vector and define binary masks (yellow is one and dark purple is zero) using $\tau_h = \tau_l = 0,4$ as thresholds. We combine the masked sparse vectors into (c) and decode the vector into a fused image (d). Finally, we apply five inference steps in (e) to obtain a less noisy image.