The Evolution of Multimodal Model Architectures

Shakti N. Wadekar; Abhishek Chaurasia; Aman Chadha; Eugenio Culurciello

The Evolution of Multimodal Model Architectures

Shakti N. Wadekar, Abhishek Chaurasia, Aman Chadha, Eugenio Culurciello

TL;DR

The paper tackles the challenge of tracking rapid advances in multimodal architectures by proposing a four-type taxonomy (Type-A to Type-D) that distinguishes deep fusion from early fusion and tokenized versus non-tokenized inputs. It systematically maps state-of-the-art models to these types, detailing architectural patterns, sub-types, training regimes, and data needs. The authors compare advantages and disadvantages across types, discuss next-generation directions (notably any-to-any capabilities via Type-C/Type-D and non-end-to-end approaches with agents), and highlight open design trade-offs. The result is a practical framework to guide model selection, development, and monitoring of the evolving multimodal landscape. Overall, the taxonomy provides a concise lens to understand how modalities are fused and scaled in contemporary models and where future work is most impactful.

Abstract

This work uniquely identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain. Distinct from recent survey papers that present general information on multimodal architectures, this research conducts a comprehensive exploration of architectural details and identifies four specific architectural types. The types are distinguished by their respective methodologies for integrating multimodal inputs into the deep neural network model. The first two types (Type A and B) deeply fuses multimodal inputs within the internal layers of the model, whereas the following two types (Type C and D) facilitate early fusion at the input stage. Type-A employs standard cross-attention, whereas Type-B utilizes custom-designed layers for modality fusion within the internal layers. On the other hand, Type-C utilizes modality-specific encoders, while Type-D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid the monitoring of any-to-any multimodal model development. Notably, Type-C and Type-D are currently favored in the construction of any-to-any multimodal models. Type-C, distinguished by its non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type-D, which utilizes input-tokenizing techniques. To assist in model selection, this work highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, simplification of adding modalities, training objectives, and any-to-any multimodal generation capability.

The Evolution of Multimodal Model Architectures

TL;DR

Abstract

Paper Structure (25 sections, 7 figures, 6 tables)

This paper contains 25 sections, 7 figures, 6 tables.

Introduction
Related Work
Multimodal Model Architectures: A Taxonomy
Type-A: Standard Cross-Attention based Deep Fusion (SCDF)
Subtype A.1
Subtype A.2
Type-B: Custom Layer based Deep Fusion (CLDF)
Sub-type B.1: Custom Cross-Attention Layer
Sub-type B.2: Custom Learnable Layer
Type-C: Non-Tokenized Early Fusion (NTEF)
Sub-type C.1: Linear Layer/MLP
Sub-type C.2: Q-former and Linear Layer/MLP
Sub-type C.3: Perceiver Resampler
Sub-type C.4: Custom Learnable layer
Training Methods and Data
...and 10 more sections

Figures (7)

Figure 1: Development timeline of Multimodal models grouped in four proposed architecture types.
Figure 2: Taxonomy of multimodal model architectures. Four distinct types of multimodal architectures and their sub-types are outlined. Various models are systematically catalogued to the types and sub-types. Deep Fusion: Type-A and Type-B fuses multimodal inputs within the internal layers of the model. Early Fusion: Type-C and Type-D facilitate fusion at the input stage. Type-A uses standard cross-attention, whereas Type-B utilizes custom-designed cross-attention or specialized layers. Type-C is a non-tokenizing multimodal model architecture, while Type-D, employs input-tokenization (discrete tokens). SCDF: Standard Cross-attention based Deep Fusion. CLDF: Custom Layer based Deep Fusion. NTEF: Non-Tokenized Early Fusion. TEF: Tokenized Early Fusion.
Figure 3: Type-A multimodal model architecture. The input modalities are deeply fused into the internal layers of the LLM using standard cross-attention layer. The cross-attention can be added either before (sub-type A.1) or after (sub-type A.2) the self-attention layer. Modality-specific encoders process the different input modalities. A resampler is used to output a fixed number of modality (visual/audio/video) tokens, given a variable number of input tokens at the input.
Figure 4: Type-B multimodal model architecture. The input modalities are deeply fused into the internal layers of the LLM using custom-designed layers. Custom cross-attention layers (sub-type A.1) or other custom layers (sub-type A.2) are used for modality fusion. A Linear Layer/MLP/Q-former is used to align different modalities with the decoder layer.
Figure 5: Type-C multimodal model architecture. The (non-tokenized) input modalities are directly fed to the model at its input, rather than to its internal layers, resulting in early fusion. Different types of modules are used to connect modality encoder outputs to the LLM (model) like a Linear-Layer/MLP (sub-type C.1), Q-former and a Linear-Layer/MLP (sub-type C.2), Perceiver resampler (sub-type C.3), Custom learnable layers (sub-type C.4).
...and 2 more figures

The Evolution of Multimodal Model Architectures

TL;DR

Abstract

The Evolution of Multimodal Model Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (7)