Table of Contents
Fetching ...

Data Extrapolation for Text-to-image Generation on Small Datasets

Senmao Ye, Fei Liu

TL;DR

This work tackles data scarcity in text-to-image generation by coupling linear extrapolation in the text feature space with web-image retrieval and rigorous outlier filtering. It introduces a NULL-guidance mechanism and a Recurrent Diffusion Transformer (RAT) to robustly fuse textual cues while maintaining reliability on small datasets. Empirical results on CUB, Oxford, and COCO show competitive or superior FID/IS scores with far less pretraining data, and extensive ablations validate the contributions of outlier detection, extrapolation quantity, and text-injection strategies. The approach enhances data efficiency and offers a scalable framework for improving diffusion-based T2I when data is limited, with potential applicability to other modalities and tasks. Specifically, the method leverages equations such as $\mathbf{w}=(\mathbf{F}^T\mathbf{F})^{-1}\mathbf{F}^T\mathbf{f}$ and $\mathbf{s}=\mathbf{S}\mathbf{w}$ for text extrapolation, NULL-score fusion $\epsilon'=(\epsilon_{text}-\epsilon_{null}) \times \eta + \epsilon_{null}$, and latent-transform conditioning $c'=\text{Transformer}((1+\gamma)\cdot c+\beta)\cdot\alpha$, to realize improved generation on small datasets.

Abstract

Text-to-image generation requires large amount of training data to synthesizing high-quality images. For augmenting training data, previous methods rely on data interpolations like cropping, flipping, and mixing up, which fail to introduce new information and yield only marginal improvements. In this paper, we propose a new data augmentation method for text-to-image generation using linear extrapolation. Specifically, we apply linear extrapolation only on text feature, and new image data are retrieved from the internet by search engines. For the reliability of new text-image pairs, we design two outlier detectors to purify retrieved images. Based on extrapolation, we construct training samples dozens of times larger than the original dataset, resulting in a significant improvement in text-to-image performance. Moreover, we propose a NULL-guidance to refine score estimation, and apply recurrent affine transformation to fuse text information. Our model achieves FID scores of 7.91, 9.52 and 5.00 on the CUB, Oxford and COCO datasets. The code and data will be available on GitHub (https://github.com/senmaoy/RAT-Diffusion).

Data Extrapolation for Text-to-image Generation on Small Datasets

TL;DR

This work tackles data scarcity in text-to-image generation by coupling linear extrapolation in the text feature space with web-image retrieval and rigorous outlier filtering. It introduces a NULL-guidance mechanism and a Recurrent Diffusion Transformer (RAT) to robustly fuse textual cues while maintaining reliability on small datasets. Empirical results on CUB, Oxford, and COCO show competitive or superior FID/IS scores with far less pretraining data, and extensive ablations validate the contributions of outlier detection, extrapolation quantity, and text-injection strategies. The approach enhances data efficiency and offers a scalable framework for improving diffusion-based T2I when data is limited, with potential applicability to other modalities and tasks. Specifically, the method leverages equations such as and for text extrapolation, NULL-score fusion , and latent-transform conditioning , to realize improved generation on small datasets.

Abstract

Text-to-image generation requires large amount of training data to synthesizing high-quality images. For augmenting training data, previous methods rely on data interpolations like cropping, flipping, and mixing up, which fail to introduce new information and yield only marginal improvements. In this paper, we propose a new data augmentation method for text-to-image generation using linear extrapolation. Specifically, we apply linear extrapolation only on text feature, and new image data are retrieved from the internet by search engines. For the reliability of new text-image pairs, we design two outlier detectors to purify retrieved images. Based on extrapolation, we construct training samples dozens of times larger than the original dataset, resulting in a significant improvement in text-to-image performance. Moreover, we propose a NULL-guidance to refine score estimation, and apply recurrent affine transformation to fuse text information. Our model achieves FID scores of 7.91, 9.52 and 5.00 on the CUB, Oxford and COCO datasets. The code and data will be available on GitHub (https://github.com/senmaoy/RAT-Diffusion).
Paper Structure (30 sections, 8 equations, 8 figures, 6 tables)

This paper contains 30 sections, 8 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: An illustration of data linear extrapolation. We use search engine and outlier detectors to ensure the image similarity. Extrapolation produces much more text-image pairs than the original dataset.
  • Figure 2: Latent diffusion model with recurrent affine transformation and NULL-guidance for text-to-image synthesis. The RAT blocks are connected by a recurrent neural network to ensure the global assignment of text information.
  • Figure 3: Qualitative comparison on the CUB and Oxford dataset. The input text descriptions are given in the first row and the corresponding generated images from different methods are shown in the same column. Best view in color and zoom in.
  • Figure 4: Qualitative comparison of our model with RAT-GAN on the COCO dataset.
  • Figure 5: Randomly generated images from the Oxford dataset. Best view in color and zoom in.
  • ...and 3 more figures