Table of Contents
Fetching ...

GRASP: Guided Residual Adapters with Sample-wise Partitioning

Felix Nützel, Mischa Dombrowski, Bernhard Kainz

TL;DR

This work tackles mode collapse in long-tail text-to-image diffusion by introducing GRASP, a training-time method that partitions samples using external priors into semantically coherent clusters and applies cluster-specific residual adapters in a frozen diffusion backbone. By eliminating gating and minimizing intra-cluster gradient conflicts, GRASP achieves higher fidelity and diversity for tail classes across medical and natural images, with measurable improvements in downstream tasks. The contributions span architectural design (static prior-guided partitioning and residual adapters), extensive validation on medical (MIMIC-CXR-LT, NIH-CXR-LT) and general (ImageNet-LT) benchmarks, and demonstration of robustness and scalability. Overall, GRASP offers a principled, efficient route to distribution-aware fine-tuning of diffusion transformers for long-tail generation.

Abstract

Recent advances in text-to-image diffusion models enable high-fidelity generation across diverse prompts. However, these models falter in long-tail settings, such as medical imaging, where rare pathologies comprise a small fraction of data. This results in mode collapse: tail-class outputs lack quality and diversity, undermining the goal of synthetic data augmentation for underrepresented conditions. We pinpoint gradient conflicts between frequent head and rare tail classes as the primary culprit, a factor unaddressed by existing sampling or conditioning methods that mainly steer inference without altering the learned distribution. To resolve this, we propose GRASP: Guided Residual Adapters with Sample-wise Partitioning. GRASP uses external priors to statically partition samples into clusters that minimize intra-group gradient clashes. It then fine-tunes pre-trained models by injecting cluster-specific residual adapters into transformer feedforward layers, bypassing learned gating for stability and efficiency. On the long-tail MIMIC-CXR-LT dataset, GRASP yields superior FID and diversity metrics, especially for rare classes, outperforming baselines like vanilla fine-tuning and Mixture of Experts variants. Downstream classification on NIH-CXR-LT improves considerably for tail labels. Generalization to ImageNet-LT confirms broad applicability. Our method is lightweight, scalable, and readily integrates with diffusion pipelines.

GRASP: Guided Residual Adapters with Sample-wise Partitioning

TL;DR

This work tackles mode collapse in long-tail text-to-image diffusion by introducing GRASP, a training-time method that partitions samples using external priors into semantically coherent clusters and applies cluster-specific residual adapters in a frozen diffusion backbone. By eliminating gating and minimizing intra-cluster gradient conflicts, GRASP achieves higher fidelity and diversity for tail classes across medical and natural images, with measurable improvements in downstream tasks. The contributions span architectural design (static prior-guided partitioning and residual adapters), extensive validation on medical (MIMIC-CXR-LT, NIH-CXR-LT) and general (ImageNet-LT) benchmarks, and demonstration of robustness and scalability. Overall, GRASP offers a principled, efficient route to distribution-aware fine-tuning of diffusion transformers for long-tail generation.

Abstract

Recent advances in text-to-image diffusion models enable high-fidelity generation across diverse prompts. However, these models falter in long-tail settings, such as medical imaging, where rare pathologies comprise a small fraction of data. This results in mode collapse: tail-class outputs lack quality and diversity, undermining the goal of synthetic data augmentation for underrepresented conditions. We pinpoint gradient conflicts between frequent head and rare tail classes as the primary culprit, a factor unaddressed by existing sampling or conditioning methods that mainly steer inference without altering the learned distribution. To resolve this, we propose GRASP: Guided Residual Adapters with Sample-wise Partitioning. GRASP uses external priors to statically partition samples into clusters that minimize intra-group gradient clashes. It then fine-tunes pre-trained models by injecting cluster-specific residual adapters into transformer feedforward layers, bypassing learned gating for stability and efficiency. On the long-tail MIMIC-CXR-LT dataset, GRASP yields superior FID and diversity metrics, especially for rare classes, outperforming baselines like vanilla fine-tuning and Mixture of Experts variants. Downstream classification on NIH-CXR-LT improves considerably for tail labels. Generalization to ImageNet-LT confirms broad applicability. Our method is lightweight, scalable, and readily integrates with diffusion pipelines.

Paper Structure

This paper contains 5 sections, 11 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Our proposed method uses parallel GRASP adapters to lessen the impact of gradient conflicts when training a diffusion transformer. This enables the model to generate images of higher quality and diversity, in particular for rare classes.
  • Figure 2: Overview of the GRASP architecture: a) We want to minimize gradient conflicts during training partitioning the samples into subsets with aligned gradient directions. b) Based on this partitioning, we deterministically route samples to their designated expert, while keeping the base model frozen.
  • Figure 3: Composition of the partitioning based on labels (top) compared to the partitioning based on text clusters (bottom).
  • Figure 4: Comparison of expert specialization/resampling impact.