Table of Contents
Fetching ...

Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters

James Seale Smith, Yen-Chang Hsu, Zsolt Kira, Yilin Shen, Hongxia Jin

TL;DR

This work tackles the challenge of scaling continual diffusion to long concept sequences, where prior methods display plasticity saturation as task counts grow. It introduces STAMINA, which stacks low-rank adapters with hard-attention masks and learnable MLP tokens to realize sparse, selective weight updates that can be folded back into the pre-trained backbone, avoiding inference-cost increases. Empirical results show STAMINA surpasses the prior state-of-the-art in 50-concept continual diffusion benchmarks without replay data and with fewer training steps, and it also achieves state-of-the-art performance in rehearsal-free image classification with ImageNet-R. The approach demonstrates robust long-sequence continual learning for text-to-image customization and transferability to other continual learning tasks, while acknowledging ethical considerations around face generation and data provenance.

Abstract

Recent work has demonstrated a remarkable ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential (i.e., continual) manner while only providing a few example images for each concept. This setting is known as continual diffusion. Here, we ask the question: Can we scale these methods to longer concept sequences without forgetting? Although prior work mitigates the forgetting of previously learned concepts, we show that its capacity to learn new tasks reaches saturation over longer sequences. We address this challenge by introducing a novel method, STack-And-Mask INcremental Adapters (STAMINA), which is composed of low-ranked attention-masked adapters and customized MLP tokens. STAMINA is designed to enhance the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, enabling precise, scalable learning via sparse adaptation. Notably, all introduced trainable parameters can be folded back into the model after training, inducing no additional inference parameter costs. We show that STAMINA outperforms the prior SOTA for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark.

Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters

TL;DR

This work tackles the challenge of scaling continual diffusion to long concept sequences, where prior methods display plasticity saturation as task counts grow. It introduces STAMINA, which stacks low-rank adapters with hard-attention masks and learnable MLP tokens to realize sparse, selective weight updates that can be folded back into the pre-trained backbone, avoiding inference-cost increases. Empirical results show STAMINA surpasses the prior state-of-the-art in 50-concept continual diffusion benchmarks without replay data and with fewer training steps, and it also achieves state-of-the-art performance in rehearsal-free image classification with ImageNet-R. The approach demonstrates robust long-sequence continual learning for text-to-image customization and transferability to other continual learning tasks, while acknowledging ethical considerations around face generation and data provenance.

Abstract

Recent work has demonstrated a remarkable ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential (i.e., continual) manner while only providing a few example images for each concept. This setting is known as continual diffusion. Here, we ask the question: Can we scale these methods to longer concept sequences without forgetting? Although prior work mitigates the forgetting of previously learned concepts, we show that its capacity to learn new tasks reaches saturation over longer sequences. We address this challenge by introducing a novel method, STack-And-Mask INcremental Adapters (STAMINA), which is composed of low-ranked attention-masked adapters and customized MLP tokens. STAMINA is designed to enhance the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, enabling precise, scalable learning via sparse adaptation. Notably, all introduced trainable parameters can be folded back into the model after training, inducing no additional inference parameter costs. We show that STAMINA outperforms the prior SOTA for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark.
Paper Structure (18 sections, 11 equations, 9 figures, 7 tables)

This paper contains 18 sections, 11 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Our work demonstrates sequentially learning long sequences of concepts. At any time, we can generate photos of any prior learned concepts, including multiple concepts together. Images denoted as "generate" in this figure are real results from our method.
  • Figure 2: Average distance from pre-trained weights, given as $||\bm{W}^{K,V}_{t}-\bm{W}^{K,V}_{init}||_2$, vs task for C-LoRA smith2023continual.
  • Figure 3: An overview of our approach. We learn custom tokens via MLPs operating on a fixed input. A prompt which includes the custom token is passed to the Stable Diffusion model. Our STAMINA approach modifies the key-value (K-V) projection in U-Net cross-attention modules without forgetting by using sparse, low-ranked, adaptations masked with MLP hard-attention. Importantly, trainable parameters, including the MLPs, can be reintegrated back into the original model backbone after training, incurring no cost to storage or inference.
  • Figure 4: Qualitative results of Continual Diffusion using celebrity faces from Celeb-A HQ karras2017progressiveliu2015deep and waterfalls from Google Landmarks weyand2020google. Results are shown for 10 samples from all 50 concepts ($\downarrow$) and are generated from the model after training on all 50 concepts. We sample for a variety of early (prone to forgetting) and late (prone to low plasticity) tasks. See \ref{['appendix:source']} for source of target images.
  • Figure 5: Our multi-concept generations after training on 50 tasks.
  • ...and 4 more figures