Table of Contents
Fetching ...

Text-to-Image Cross-Modal Generation: A Systematic Review

Maciej Żelaszczyk, Jacek Mańdziuk

TL;DR

This survey addresses the problem of generating visual data from text through cross-modal generation, organizing the literature into coherent templates and tasks. It surveys vanilla and extended approaches across VAE, GAN, diffusion, and Transformer paradigms, as well as iterative generation, self-supervised methods, video, editing, and graph-based techniques. Key contributions include mapping common architectural templates, examining performance trends, and identifying gaps such as data availability, multilinguality, and evaluation standardization. The findings highlight diffusion as the current leading paradigm for high-quality text-to-image synthesis, while also underscoring the importance of cross-modal conditioning, hierarchical generation, and potential future directions like knowledge-grounded and non-paired data methods with broader practical impact.

Abstract

We review research on generating visual data from text from the angle of "cross-modal generation." This point of view allows us to draw parallels between various methods geared towards working on input text and producing visual output, without limiting the analysis to narrow sub-areas. It also results in the identification of common templates in the field, which are then compared and contrasted both within pools of similar methods and across lines of research. We provide a breakdown of text-to-image generation into various flavors of image-from-text methods, video-from-text methods, image editing, self-supervised and graph-based approaches. In this discussion, we focus on research papers published at 8 leading machine learning conferences in the years 2016-2022, also incorporating a number of relevant papers not matching the outlined search criteria. The conducted review suggests a significant increase in the number of papers published in the area and highlights research gaps and potential lines of investigation. To our knowledge, this is the first review to systematically look at text-to-image generation from the perspective of "cross-modal generation."

Text-to-Image Cross-Modal Generation: A Systematic Review

TL;DR

This survey addresses the problem of generating visual data from text through cross-modal generation, organizing the literature into coherent templates and tasks. It surveys vanilla and extended approaches across VAE, GAN, diffusion, and Transformer paradigms, as well as iterative generation, self-supervised methods, video, editing, and graph-based techniques. Key contributions include mapping common architectural templates, examining performance trends, and identifying gaps such as data availability, multilinguality, and evaluation standardization. The findings highlight diffusion as the current leading paradigm for high-quality text-to-image synthesis, while also underscoring the importance of cross-modal conditioning, hierarchical generation, and potential future directions like knowledge-grounded and non-paired data methods with broader practical impact.

Abstract

We review research on generating visual data from text from the angle of "cross-modal generation." This point of view allows us to draw parallels between various methods geared towards working on input text and producing visual output, without limiting the analysis to narrow sub-areas. It also results in the identification of common templates in the field, which are then compared and contrasted both within pools of similar methods and across lines of research. We provide a breakdown of text-to-image generation into various flavors of image-from-text methods, video-from-text methods, image editing, self-supervised and graph-based approaches. In this discussion, we focus on research papers published at 8 leading machine learning conferences in the years 2016-2022, also incorporating a number of relevant papers not matching the outlined search criteria. The conducted review suggests a significant increase in the number of papers published in the area and highlights research gaps and potential lines of investigation. To our knowledge, this is the first review to systematically look at text-to-image generation from the perspective of "cross-modal generation."
Paper Structure (16 sections, 15 equations, 10 figures, 3 tables)

This paper contains 16 sections, 15 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Conditioning image generation on additional data. An image is processed by an image encoder and the text describing the characteristics of the desired generated image is processed by a text encoder. The features obtained from both encoders are fused in a joint feature representation, which is passed along to the image decoder to produce the actual image.
  • Figure 2: Cross-modal text-to-image generation research areas. These research areas are identified with respect to the architectures, training methods or tasks performed. The specific methods within each area can be significantly different, depending on concrete problems handled.
  • Figure 3: Standard image-from-text VAE template. The text describing the desired image is processed by an RNN encoder in order to obtain parameters $\mu$ and $\sigma$, which are used to sample the feature representation $\mathbf{f}$ of the image to be produced. This representation is used by a TCNN decoder to produce the actual image. A reconstruction loss is used in training to ensure the fidelity of the generated images to the dataset used for training, while a KL divergence term is used to enforce the alignment of $\mu$ and $\sigma$ with the parameters of a predetermined distribution.
  • Figure 4: Standard image-from-text GAN template. The text description of the desired image is processed by an RNN encoder to obtain the features of the text $\mathbf{f_{t}}$. A random component $\mathbf{z}$ is sampled separately from a predetermined distribution. In the generator part of the architecture, both $\mathbf{f_{i}}$ and $\mathbf{z}$ are combined together in a joint representation $\mathbf{f}$, which is fed to a TCNN decoder to produce the image. The discriminator part of the architecture handles such produced images, as well as real images from the dataset. An image is processed by a CNN encoder to arrive at the image features $\mathbf{i}$. These image features, together with the text features $\mathbf{f_{t}}$ are fed into a binary classifier which aims to distinguish whether the processed image is real (comes from the dataset) or fake (has been produced by the generator). During training, the generator and discriminator compete to push the objective function into different directions.
  • Figure 5: Standard image-from-text diffusion template. Training: An image is processed by an image encoder to obtain features $\mathbf{f}$. These features are subjected to a forward diffusion process where noise is gradually added to them, resulting in the final noise $\mathbf{f_{T}}$, which is passed to the denoising network along with the conditioning information from the RNN encoder. The denoising network progressively removes the noise from the representation to finally arrive at the reconstructed representation $\mathbf{f^{*}}$, which is processed by the image decoder to produce the reconstructed image. The optimization procedure attempts to successfully reconstruct the input image. Inference: instead of $\mathbf{f_{T}}$ obtained through forward diffusion, randomly sampled noise is used as input to the denoising network, along with the output of the RNN text encoder.
  • ...and 5 more figures