Table of Contents
Fetching ...

Few-shot Learner Parameterization by Diffusion Time-steps

Zhongqi Yue, Pan Zhou, Richang Hong, Hanwang Zhang, Qianru Sun

TL;DR

Time-step Few-shot (TiF) learner significantly outperforms OpenCLIP and its adapters on a variety of fine-grained and customized few-shot learning tasks.

Abstract

Even when using large multi-modal foundation models, few-shot learning is still challenging -- if there is no proper inductive bias, it is nearly impossible to keep the nuanced class attributes while removing the visually prominent attributes that spuriously correlate with class labels. To this end, we find an inductive bias that the time-steps of a Diffusion Model (DM) can isolate the nuanced class attributes, i.e., as the forward diffusion adds noise to an image at each time-step, nuanced attributes are usually lost at an earlier time-step than the spurious attributes that are visually prominent. Building on this, we propose Time-step Few-shot (TiF) learner. We train class-specific low-rank adapters for a text-conditioned DM to make up for the lost attributes, such that images can be accurately reconstructed from their noisy ones given a prompt. Hence, at a small time-step, the adapter and prompt are essentially a parameterization of only the nuanced class attributes. For a test image, we can use the parameterization to only extract the nuanced class attributes for classification. TiF learner significantly outperforms OpenCLIP and its adapters on a variety of fine-grained and customized few-shot learning tasks. Codes are in https://github.com/yue-zhongqi/tif.

Few-shot Learner Parameterization by Diffusion Time-steps

TL;DR

Time-step Few-shot (TiF) learner significantly outperforms OpenCLIP and its adapters on a variety of fine-grained and customized few-shot learning tasks.

Abstract

Even when using large multi-modal foundation models, few-shot learning is still challenging -- if there is no proper inductive bias, it is nearly impossible to keep the nuanced class attributes while removing the visually prominent attributes that spuriously correlate with class labels. To this end, we find an inductive bias that the time-steps of a Diffusion Model (DM) can isolate the nuanced class attributes, i.e., as the forward diffusion adds noise to an image at each time-step, nuanced attributes are usually lost at an earlier time-step than the spurious attributes that are visually prominent. Building on this, we propose Time-step Few-shot (TiF) learner. We train class-specific low-rank adapters for a text-conditioned DM to make up for the lost attributes, such that images can be accurately reconstructed from their noisy ones given a prompt. Hence, at a small time-step, the adapter and prompt are essentially a parameterization of only the nuanced class attributes. For a test image, we can use the parameterization to only extract the nuanced class attributes for classification. TiF learner significantly outperforms OpenCLIP and its adapters on a variety of fine-grained and customized few-shot learning tasks. Codes are in https://github.com/yue-zhongqi/tif.
Paper Structure (15 sections, 10 equations, 6 figures, 3 tables)

This paper contains 15 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (a) GradCAM selvaraju2017grad of Tip-Adapter tipadapter on a 4-shot learning task from FGVCAircraft fgvc, where it is biased to the spurious background. (b) Comparison of few-shot learning performance. Our DM-based method significantly outperforms zero-shot OpenCLIP ilharco_gabriel_2021_5143773 (ViT-H/14 trained on LAION-2B schuhmann2022laionb) and its adapter.
  • Figure 2: Top: DM forward process with attribute loss examples. Bottom: Attention map of what $(y,t,\theta_c)$ parameterizes for $c=$A or B, which includes only nuances at a small $t$, and expands as $t$ increases when more attributes are lost. We follow mokady2023null to compute the average attention over a small time-step range indicated by the blue line. Details in Appendix. Red: Our proposed hyper-parameter-free weights for all time-steps.
  • Figure 3: Plot of $\mathrm{Err}(\mathbf{x}_0,\mathbf{x'_0},t)$ on 4 pairs of $(\mathbf{x}_0,\mathbf{x}'_0)$ with different pixel-level differences. We observe that the attribute loss for each pair is strictly increasing in $t$, and the fine-grained attribute that distinguishing more similar image pair is lost earlier.
  • Figure 4: Overall pipeline of TiF learner. (a) Green: SD U-Net with each attention block illustrated by a rectangle. Red arrows: We inject trainable LoRA matrices $\theta_c$ to the attention blocks of U-Net and text encoder. Solid lines: always injected; dotted lines: optional (studied in ablation). We train $\theta_c$ to reconstruct $\mathbf{x}_0$ from $\mathbf{x}_t$ by $\mathcal{L}_t$. (b) Our inference rule by computing a weighted average $\mathcal{L}_t$ over time-steps.
  • Figure 5: Comparison of synthesized images with two LoRA ranks. LoRA overfits to irrelevant details when rank is too high (top, rank 16), or fails to capture the nuances accurately when rank is too low (bottom, rank 8).
  • ...and 1 more figures