Text-to-Image Cross-Modal Generation: A Systematic Review
Maciej Żelaszczyk, Jacek Mańdziuk
TL;DR
This survey addresses the problem of generating visual data from text through cross-modal generation, organizing the literature into coherent templates and tasks. It surveys vanilla and extended approaches across VAE, GAN, diffusion, and Transformer paradigms, as well as iterative generation, self-supervised methods, video, editing, and graph-based techniques. Key contributions include mapping common architectural templates, examining performance trends, and identifying gaps such as data availability, multilinguality, and evaluation standardization. The findings highlight diffusion as the current leading paradigm for high-quality text-to-image synthesis, while also underscoring the importance of cross-modal conditioning, hierarchical generation, and potential future directions like knowledge-grounded and non-paired data methods with broader practical impact.
Abstract
We review research on generating visual data from text from the angle of "cross-modal generation." This point of view allows us to draw parallels between various methods geared towards working on input text and producing visual output, without limiting the analysis to narrow sub-areas. It also results in the identification of common templates in the field, which are then compared and contrasted both within pools of similar methods and across lines of research. We provide a breakdown of text-to-image generation into various flavors of image-from-text methods, video-from-text methods, image editing, self-supervised and graph-based approaches. In this discussion, we focus on research papers published at 8 leading machine learning conferences in the years 2016-2022, also incorporating a number of relevant papers not matching the outlined search criteria. The conducted review suggests a significant increase in the number of papers published in the area and highlights research gaps and potential lines of investigation. To our knowledge, this is the first review to systematically look at text-to-image generation from the perspective of "cross-modal generation."
