Table of Contents
Fetching ...

Tiny models from tiny data: Textual and null-text inversion for few-shot distillation

Erik Landolsi, Fredrik Kahl

TL;DR

This work tackles few-shot classification with tiny models by leveraging diffusion-model inversion to synthesize task-relevant data from a handful of examples. The proposed Textual + Null-Text Inversion (TINT) combines Textual Inversion and Null-Text Inversion to generate diverse, class-consistent images, which are then used to distill knowledge from a strong teacher into a compact student. A theoretical analysis of estimator variance guides efficient multi-episode evaluation, enabling practical benchmarking. Empirically, TINT achieves state-of-the-art or competitive accuracy for small models on miniImageNet, CUB, and CIFAR-FS, while offering orders-of-magnitude faster data generation than prior generative distillation methods, illustrating the practical impact of generative data for tiny models.

Abstract

Few-shot learning deals with problems such as image classification using very few training examples. Recent vision foundation models show excellent few-shot transfer abilities, but are large and slow at inference. Using knowledge distillation, the capabilities of high-performing but slow models can be transferred to tiny, efficient models. However, common distillation methods require a large set of unlabeled data, which is not available in the few-shot setting. To overcome this lack of data, there has been a recent interest in using synthetic data. We expand on this line of research by presenting a novel diffusion model inversion technique (TINT) combining the diversity of textual inversion with the specificity of null-text inversion. Using this method in a few-shot distillation pipeline leads to state-of-the-art accuracy among small student models on popular benchmarks, while being significantly faster than prior work. Popular few-shot benchmarks involve evaluation over a large number of episodes, which is computationally cumbersome for methods involving synthetic data generation. We also present a theoretical analysis on how the accuracy estimator variance depends on the number of episodes and query examples, and use these results to lower the computational effort required for method evaluation. Finally, to further motivate the use of generative models in few-shot distillation, we demonstrate that our method outperforms training on real data mined from the dataset used in the original diffusion model training. Source code is available at https://github.com/pixwse/tiny2.

Tiny models from tiny data: Textual and null-text inversion for few-shot distillation

TL;DR

This work tackles few-shot classification with tiny models by leveraging diffusion-model inversion to synthesize task-relevant data from a handful of examples. The proposed Textual + Null-Text Inversion (TINT) combines Textual Inversion and Null-Text Inversion to generate diverse, class-consistent images, which are then used to distill knowledge from a strong teacher into a compact student. A theoretical analysis of estimator variance guides efficient multi-episode evaluation, enabling practical benchmarking. Empirically, TINT achieves state-of-the-art or competitive accuracy for small models on miniImageNet, CUB, and CIFAR-FS, while offering orders-of-magnitude faster data generation than prior generative distillation methods, illustrating the practical impact of generative data for tiny models.

Abstract

Few-shot learning deals with problems such as image classification using very few training examples. Recent vision foundation models show excellent few-shot transfer abilities, but are large and slow at inference. Using knowledge distillation, the capabilities of high-performing but slow models can be transferred to tiny, efficient models. However, common distillation methods require a large set of unlabeled data, which is not available in the few-shot setting. To overcome this lack of data, there has been a recent interest in using synthetic data. We expand on this line of research by presenting a novel diffusion model inversion technique (TINT) combining the diversity of textual inversion with the specificity of null-text inversion. Using this method in a few-shot distillation pipeline leads to state-of-the-art accuracy among small student models on popular benchmarks, while being significantly faster than prior work. Popular few-shot benchmarks involve evaluation over a large number of episodes, which is computationally cumbersome for methods involving synthetic data generation. We also present a theoretical analysis on how the accuracy estimator variance depends on the number of episodes and query examples, and use these results to lower the computational effort required for method evaluation. Finally, to further motivate the use of generative models in few-shot distillation, we demonstrate that our method outperforms training on real data mined from the dataset used in the original diffusion model training. Source code is available at https://github.com/pixwse/tiny2.
Paper Structure (31 sections, 1 theorem, 10 equations, 10 figures, 4 tables, 2 algorithms)

This paper contains 31 sections, 1 theorem, 10 equations, 10 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Let $P, Q \in \mathbb{Z}^{+}$ be the number of episodes and query examples per episode respectively. Let $a_p \in [0, 1]$ be the true (unknown) accuracy of an evaluated method on episode $p \in \{1, ..., P\}$ and let $a = \mathbb{E}[a_p]$ and $\sigma_a^2 = \mathrm{Var}[a_p]$ be the true (unknown) me

Figures (10)

  • Figure 1: Overview of the TINT method. Left: From a set of input examples, we optimize all external quantities (latents and conditional/unconditional embeddings). Right: A new image is generated by blending a latent with noise and feeding it through the diffusion model.
  • Figure 2: Examples of images generated by our method (bottom) from randomly selected support images (top). Left: Class Scoreboard from miniImageNet. Right: Class Caspian Tern from CUB.
  • Figure 3: Generated images with increasing $\alpha$ (latent space noise), showing how the method gradually transitions from a more augmentation-like behavior to full synthetic image generation as $\alpha$ increases.
  • Figure 4: Overview of our few-shot transfer pipeline. First, the TINT generator and teacher are specialized on the novel classes (a), and then a distillation procedure is run to transfer the knowledge from the teacher to the student (b). The modules adjusted in each step are outlined in red.
  • Figure 5: More qualitative examples of support (top) and generated (bottom) examples for two classes from miniImageNet. Left: guitar, right: roundworm (failure example).
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 1