Data Extrapolation for Text-to-image Generation on Small Datasets
Senmao Ye, Fei Liu
TL;DR
This work tackles data scarcity in text-to-image generation by coupling linear extrapolation in the text feature space with web-image retrieval and rigorous outlier filtering. It introduces a NULL-guidance mechanism and a Recurrent Diffusion Transformer (RAT) to robustly fuse textual cues while maintaining reliability on small datasets. Empirical results on CUB, Oxford, and COCO show competitive or superior FID/IS scores with far less pretraining data, and extensive ablations validate the contributions of outlier detection, extrapolation quantity, and text-injection strategies. The approach enhances data efficiency and offers a scalable framework for improving diffusion-based T2I when data is limited, with potential applicability to other modalities and tasks. Specifically, the method leverages equations such as $\mathbf{w}=(\mathbf{F}^T\mathbf{F})^{-1}\mathbf{F}^T\mathbf{f}$ and $\mathbf{s}=\mathbf{S}\mathbf{w}$ for text extrapolation, NULL-score fusion $\epsilon'=(\epsilon_{text}-\epsilon_{null}) \times \eta + \epsilon_{null}$, and latent-transform conditioning $c'=\text{Transformer}((1+\gamma)\cdot c+\beta)\cdot\alpha$, to realize improved generation on small datasets.
Abstract
Text-to-image generation requires large amount of training data to synthesizing high-quality images. For augmenting training data, previous methods rely on data interpolations like cropping, flipping, and mixing up, which fail to introduce new information and yield only marginal improvements. In this paper, we propose a new data augmentation method for text-to-image generation using linear extrapolation. Specifically, we apply linear extrapolation only on text feature, and new image data are retrieved from the internet by search engines. For the reliability of new text-image pairs, we design two outlier detectors to purify retrieved images. Based on extrapolation, we construct training samples dozens of times larger than the original dataset, resulting in a significant improvement in text-to-image performance. Moreover, we propose a NULL-guidance to refine score estimation, and apply recurrent affine transformation to fuse text information. Our model achieves FID scores of 7.91, 9.52 and 5.00 on the CUB, Oxford and COCO datasets. The code and data will be available on GitHub (https://github.com/senmaoy/RAT-Diffusion).
