Table of Contents
Fetching ...

ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets

Hoyoung Kim, Minwoo Jang, Jabin Koo, Sangdoo Yun, Jungseul Ok

TL;DR

ChimeraLoRA addresses data scarcity in specialized and long-tailed domains by unifying class-level priors and image-specific details through a multi-head LoRA framework. It uses a shared adapter $A$ for class priors and per-image adapters $\mathcal{B}$, augmented by semantic boosting with bounding boxes, and generates images by merging heads via weights $w\sim\text{Dirichlet}(\boldsymbol{\alpha})$ to form $B'$. The method demonstrates improved downstream accuracy and a reduced synthetic-to-real gap across diverse datasets, including medical and long-tail tasks, while using fewer trainable parameters than baselines. These results suggest practical viability for few-shot learning regimes where data collection is constrained, enabling more robust and diverse synthetic datasets for training. The approach offers a principled way to balance fidelity and diversity in diffusion-model–based data augmentation and highlights opportunities for extending semantic-aware augmentation with soft labels or per-semantic sampling.

Abstract

Beyond general recognition tasks, specialized domains including privacy-constrained medical applications and fine-grained settings often encounter data scarcity, especially for tail classes. To obtain less biased and more reliable models under such scarcity, practitioners leverage diffusion models to supplement underrepresented regions of real data. Specifically, recent studies fine-tune pretrained diffusion models with LoRA on few-shot real sets to synthesize additional images. While an image-wise LoRA trained on a single image captures fine-grained details yet offers limited diversity, a class-wise LoRA trained over all shots produces diverse images as it encodes class priors yet tends to overlook fine details. To combine both benefits, we separate the adapter into a class-shared LoRA~$A$ for class priors and per-image LoRAs~$\mathcal{B}$ for image-specific characteristics. To expose coherent class semantics in the shared LoRA~$A$, we propose a semantic boosting by preserving class bounding boxes during training. For generation, we compose $A$ with a mixture of $\mathcal{B}$ using coefficients drawn from a Dirichlet distribution. Across diverse datasets, our synthesized images are both diverse and detail-rich while closely aligning with the few-shot real distribution, yielding robust gains in downstream classification accuracy.

ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets

TL;DR

ChimeraLoRA addresses data scarcity in specialized and long-tailed domains by unifying class-level priors and image-specific details through a multi-head LoRA framework. It uses a shared adapter for class priors and per-image adapters , augmented by semantic boosting with bounding boxes, and generates images by merging heads via weights to form . The method demonstrates improved downstream accuracy and a reduced synthetic-to-real gap across diverse datasets, including medical and long-tail tasks, while using fewer trainable parameters than baselines. These results suggest practical viability for few-shot learning regimes where data collection is constrained, enabling more robust and diverse synthetic datasets for training. The approach offers a principled way to balance fidelity and diversity in diffusion-model–based data augmentation and highlights opportunities for extending semantic-aware augmentation with soft labels or per-semantic sampling.

Abstract

Beyond general recognition tasks, specialized domains including privacy-constrained medical applications and fine-grained settings often encounter data scarcity, especially for tail classes. To obtain less biased and more reliable models under such scarcity, practitioners leverage diffusion models to supplement underrepresented regions of real data. Specifically, recent studies fine-tune pretrained diffusion models with LoRA on few-shot real sets to synthesize additional images. While an image-wise LoRA trained on a single image captures fine-grained details yet offers limited diversity, a class-wise LoRA trained over all shots produces diverse images as it encodes class priors yet tends to overlook fine details. To combine both benefits, we separate the adapter into a class-shared LoRA~ for class priors and per-image LoRAs~ for image-specific characteristics. To expose coherent class semantics in the shared LoRA~, we propose a semantic boosting by preserving class bounding boxes during training. For generation, we compose with a mixture of using coefficients drawn from a Dirichlet distribution. Across diverse datasets, our synthesized images are both diverse and detail-rich while closely aligning with the few-shot real distribution, yielding robust gains in downstream classification accuracy.
Paper Structure (31 sections, 7 equations, 8 figures, 4 tables)

This paper contains 31 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An overview of the proposed method. (left) We synthesize images with a multi-head LoRA that integrates the strengths of image-wise LoRA (LoFT kim2025loft) and class-wise LoRA (DataDream kim2024datadream). The blue and red regions indicate where LoRA is applied during generation. (center) Given few-shot images, we fine-tune the multi-head LoRA while preserving bounding boxes obtained from Grounded-SAM liu2023groundingkirillov2023segment. (right) We merge LoRA heads using weights sampled from a Dirichlet distribution to obtain diverse synthetic images.
  • Figure 2: Qualitative results of synthetic images. (top) Four real images per class. (bottom) Synthetic images generated with LoRA based methods. For the camera class, LoFT (image-wise LoRA) shows low diversity with near duplicate single viewpoint shots, while DataDream (class-wise LoRA) increases diversity but lowers fidelity, often failing to render a camera. Our multi-head LoRA produces accurate cameras across varied viewpoints. Here, Avg merges heads with uniform weights and Dir uses Dirichlet sampled weights.
  • Figure 3: Robust generation of semantic boosting (SB). Without SB, a LoRA trained on a one-shot image often fails to render a car even when prompted with "a photo of a car". With SB, repeated exposure to the car region during training robustly generates complete cars.
  • Figure 4: Effect of semantic boosting. (a) The input image is used to train a LoRA under varying cropping methods. (b) Without cropping, the generated images exhibit a distorted aspect ratio of the primary object. (c, d) Conventional random and center cropping methods result in outputs where the object is consistently truncated. (e) In contrast, our semantic boosting preserves the object's structural integrity and details, leading to a robust generation.
  • Figure 5: t-SNE for real and synthetic images. ChimeraLoRA generates mainly inside the region spanned by the real anchors marked with crosses and attains the highest coverage across methods, with Cov$(\mathcal{R};\mathcal{S}) = 0.93$ and Cov$(\mathcal{S};\mathcal{R}) = 0.90$.
  • ...and 3 more figures