Survey: Transformer-based Models in Data Modality Conversion
Elyas Rashno, Amir Eskandari, Aman Anand, Farhana Zulkernine
TL;DR
This survey addresses the gap in systematic coverage of transformer-based data modality conversion across text, vision, and speech. It synthesizes architectural patterns and conversion methodologies, cataloging encoder-only, decoder-only, and encoder–decoder NLP models, as well as vision and speech TB architectures. The review highlights cross-modal tasks such as text-to-image, image-to-text, and speech-to-text, and discusses downstream applications including translation, SR, captioning, and NER, with attention to efficiency and scalability. The work emphasizes future directions in efficiency, multimodal integration, scalability, real-time deployment, and ethical AI, illustrating the transformative potential of transformers in cross-modal content generation and understanding.
Abstract
Transformers have made significant strides across various artificial intelligence domains, including natural language processing, computer vision, and audio processing. This success has naturally garnered considerable interest from both academic and industry researchers. Consequently, numerous Transformer variants (often referred to as X-formers) have been developed for these fields. However, a thorough and systematic review of these modality-specific conversions remains lacking. Modality Conversion involves the transformation of data from one form of representation to another, mimicking the way humans integrate and interpret sensory information. This paper provides a comprehensive review of transformer-based models applied to the primary modalities of text, vision, and speech, discussing their architectures, conversion methodologies, and applications. By synthesizing the literature on modality conversion, this survey aims to underline the versatility and scalability of transformers in advancing AI-driven content generation and understanding.
