Table of Contents
Fetching ...

Survey: Transformer-based Models in Data Modality Conversion

Elyas Rashno, Amir Eskandari, Aman Anand, Farhana Zulkernine

TL;DR

This survey addresses the gap in systematic coverage of transformer-based data modality conversion across text, vision, and speech. It synthesizes architectural patterns and conversion methodologies, cataloging encoder-only, decoder-only, and encoder–decoder NLP models, as well as vision and speech TB architectures. The review highlights cross-modal tasks such as text-to-image, image-to-text, and speech-to-text, and discusses downstream applications including translation, SR, captioning, and NER, with attention to efficiency and scalability. The work emphasizes future directions in efficiency, multimodal integration, scalability, real-time deployment, and ethical AI, illustrating the transformative potential of transformers in cross-modal content generation and understanding.

Abstract

Transformers have made significant strides across various artificial intelligence domains, including natural language processing, computer vision, and audio processing. This success has naturally garnered considerable interest from both academic and industry researchers. Consequently, numerous Transformer variants (often referred to as X-formers) have been developed for these fields. However, a thorough and systematic review of these modality-specific conversions remains lacking. Modality Conversion involves the transformation of data from one form of representation to another, mimicking the way humans integrate and interpret sensory information. This paper provides a comprehensive review of transformer-based models applied to the primary modalities of text, vision, and speech, discussing their architectures, conversion methodologies, and applications. By synthesizing the literature on modality conversion, this survey aims to underline the versatility and scalability of transformers in advancing AI-driven content generation and understanding.

Survey: Transformer-based Models in Data Modality Conversion

TL;DR

This survey addresses the gap in systematic coverage of transformer-based data modality conversion across text, vision, and speech. It synthesizes architectural patterns and conversion methodologies, cataloging encoder-only, decoder-only, and encoder–decoder NLP models, as well as vision and speech TB architectures. The review highlights cross-modal tasks such as text-to-image, image-to-text, and speech-to-text, and discusses downstream applications including translation, SR, captioning, and NER, with attention to efficiency and scalability. The work emphasizes future directions in efficiency, multimodal integration, scalability, real-time deployment, and ethical AI, illustrating the transformative potential of transformers in cross-modal content generation and understanding.

Abstract

Transformers have made significant strides across various artificial intelligence domains, including natural language processing, computer vision, and audio processing. This success has naturally garnered considerable interest from both academic and industry researchers. Consequently, numerous Transformer variants (often referred to as X-formers) have been developed for these fields. However, a thorough and systematic review of these modality-specific conversions remains lacking. Modality Conversion involves the transformation of data from one form of representation to another, mimicking the way humans integrate and interpret sensory information. This paper provides a comprehensive review of transformer-based models applied to the primary modalities of text, vision, and speech, discussing their architectures, conversion methodologies, and applications. By synthesizing the literature on modality conversion, this survey aims to underline the versatility and scalability of transformers in advancing AI-driven content generation and understanding.
Paper Structure (52 sections, 9 figures, 8 tables)

This paper contains 52 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Overview of the paper's structure, featuring three modalities: text (blue), vision (red), and speech (green). Each section introduces one modality, identified by its color and name. It covers well-known TBs and their primary applications. The right-hand boxes for each modality illustrate conversion processes and notable applications.
  • Figure 2: The overall structure of the paper is as follows: Related surveys will be introduced in the second section. The basic transformer model (Vanilla) will be detailed in the third section. The last three sections will cover methods for text, vision, and speech processing, respectively.
  • Figure 3: Recent surveys related to text, vision, speech, and multi-modality.
  • Figure 4: Structure of the Vanilla Transformer intro_1vaswani2017attention
  • Figure 5: Dual encoder and cross-attention models in Text to vision approaches.
  • ...and 4 more figures