Table of Contents
Fetching ...

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Minghao Shao, Abdul Basit, Ramesh Karri, Muhammad Shafique

TL;DR

This survey analyzes the rapid evolution of transformer-based large language models (LLMs) and the burgeoning field of multimodal LLMs (MLLMs). It categorizes LLMs into encoder-only, decoder-only, and encoder-decoder families, and surveys pre-training and fine-tuning techniques, including parameter-efficient methods and mixture-of-experts approaches. The paper provides a comprehensive benchmarking panorama across language and multimodal tasks (e.g., MMLU, SuperGLUE, NLVR2, VQA), discusses data quality and bias, model compression, and distributed computation, and reviews leading models (e.g., GPT, PaLM, LLaMA, Gopher, PaLM-E, KOSMOS-1) and their multimodal extensions. Overall, it highlights trends toward efficiency, scalability, and safety, and outlines practical directions for robust, scalable, and trustworthy LLMs in diverse domains.

Abstract

Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language and generating coherent responses to various prompts or queries. These models far exceed the complexity of conventional neural networks, often encompassing dozens of neural network layers and containing billions to trillions of parameters. They are typically trained on vast datasets, utilizing architectures based on transformer blocks. Present-day LLMs are multi-functional, capable of performing a range of tasks from text generation and language translation to question answering, as well as code generation and analysis. An advanced subset of these models, known as Multimodal Large Language Models (MLLMs), extends LLM capabilities to process and interpret multiple data modalities, including images, audio, and video. This enhancement empowers MLLMs with capabilities like video editing, image comprehension, and captioning for visual content. This survey provides a comprehensive overview of the recent advancements in LLMs. We begin by tracing the evolution of LLMs and subsequently delve into the advent and nuances of MLLMs. We analyze emerging state-of-the-art MLLMs, exploring their technical features, strengths, and limitations. Additionally, we present a comparative analysis of these models and discuss their challenges, potential limitations, and prospects for future development.

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

TL;DR

This survey analyzes the rapid evolution of transformer-based large language models (LLMs) and the burgeoning field of multimodal LLMs (MLLMs). It categorizes LLMs into encoder-only, decoder-only, and encoder-decoder families, and surveys pre-training and fine-tuning techniques, including parameter-efficient methods and mixture-of-experts approaches. The paper provides a comprehensive benchmarking panorama across language and multimodal tasks (e.g., MMLU, SuperGLUE, NLVR2, VQA), discusses data quality and bias, model compression, and distributed computation, and reviews leading models (e.g., GPT, PaLM, LLaMA, Gopher, PaLM-E, KOSMOS-1) and their multimodal extensions. Overall, it highlights trends toward efficiency, scalability, and safety, and outlines practical directions for robust, scalable, and trustworthy LLMs in diverse domains.

Abstract

Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language and generating coherent responses to various prompts or queries. These models far exceed the complexity of conventional neural networks, often encompassing dozens of neural network layers and containing billions to trillions of parameters. They are typically trained on vast datasets, utilizing architectures based on transformer blocks. Present-day LLMs are multi-functional, capable of performing a range of tasks from text generation and language translation to question answering, as well as code generation and analysis. An advanced subset of these models, known as Multimodal Large Language Models (MLLMs), extends LLM capabilities to process and interpret multiple data modalities, including images, audio, and video. This enhancement empowers MLLMs with capabilities like video editing, image comprehension, and captioning for visual content. This survey provides a comprehensive overview of the recent advancements in LLMs. We begin by tracing the evolution of LLMs and subsequently delve into the advent and nuances of MLLMs. We analyze emerging state-of-the-art MLLMs, exploring their technical features, strengths, and limitations. Additionally, we present a comparative analysis of these models and discuss their challenges, potential limitations, and prospects for future development.

Paper Structure

This paper contains 86 sections, 6 equations, 30 figures, 14 tables.

Figures (30)

  • Figure 1: Structured layout of the paper is presented, detailing the organization of sections including introduction, background and comparison, state-of-the-art LLMs and methodologies, challenges, and conclusion.
  • Figure 2: The architecture of the Transformer model, which includes an encoder-decoder structure. Key components such as multi-head attention, positional encoding, and residual connections facilitate efficient learning and performance in tasks such as natural language processing and machine translation.
  • Figure 3: (A) Workflow of Auto-encoder, auto-encoder encode the feature attribute directly. (B) Workflow of Variational Auto-encoder, different from auto-encoder, VAEs encode the feature distribution and reconstruct the image based on the sample of distribution, which give the VAEs the ability to generate new images.
  • Figure 4: Basic architecture of a Generative Adversarial Network (GAN). The generator creates synthetic images from a latent space, while the discriminator distinguishes between real images from the dataset and generated images. The generator is trained to maximize the discriminator’s error, while the discriminator is trained to minimize its error in distinguishing real from fake images, leading to adversarial learning between the two components.
  • Figure 5: Performance trajectory of various LLMs on the MMLU benchmark illustrating average accuracy percentages. The 'Random Baseline' represents a lower bound of performance, while the highlighted teal line traces the models with the highest average scores each year, culminating with the introduction of models like GPT-4o and Gemini Ultra that set new benchmarks for language understanding.
  • ...and 25 more figures