Table of Contents
Fetching ...

A Review of Multi-Modal Large Language and Vision Models

Kilian Carolan, Laura Fennelly, Alan F. Smeaton

TL;DR

This survey traces the evolution from traditional language models to multimodal large language models (MM-LLMs), highlighting transformer architectures and attention mechanisms as foundational. It analyzes major text LLMs (GPT, Claude, Gemini, LLaMA, Falcon, Grok) and major vision/model-vision integrations (BLIP-2, CLIP, ViT), detailing how MM-LLMs are built through end-to-end or staged training, and how tuning methods (full fine-tuning, PEFT, RLHF, prompt engineering) shape performance. The article also weighs open-source versus proprietary models, discusses ethical concerns and data governance, and reviews evaluation benchmarks and hallucination mitigation strategies. Overall, it provides a practical synthesis of approaches for deploying MM-LLMs across domains, emphasizing data quality, cost considerations, and responsible research practices.

Abstract

Large Language Models (LLMs) have recently emerged as a focal point of research and application, driven by their unprecedented ability to understand and generate text with human-like quality. Even more recently, LLMs have been extended into multi-modal large language models (MM-LLMs) which extends their capabilities to deal with image, video and audio information, in addition to text. This opens up applications like text-to-video generation, image captioning, text-to-speech, and more and is achieved either by retro-fitting an LLM with multi-modal capabilities, or building a MM-LLM from scratch. This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs. It covers the historical development of LLMs especially the advances enabled by transformer-based architectures like OpenAI's GPT series and Google's BERT, as well as the role of attention mechanisms in enhancing model performance. The paper includes coverage of the major and most important of the LLMs and MM-LLMs and also covers the techniques of model tuning, including fine-tuning and prompt engineering, which tailor pre-trained models to specific tasks or domains. Ethical considerations and challenges, such as data bias and model misuse, are also analysed to underscore the importance of responsible AI development and deployment. Finally, we discuss the implications of open-source versus proprietary models in AI research. Through this review, we provide insights into the transformative potential of MM-LLMs in various applications.

A Review of Multi-Modal Large Language and Vision Models

TL;DR

This survey traces the evolution from traditional language models to multimodal large language models (MM-LLMs), highlighting transformer architectures and attention mechanisms as foundational. It analyzes major text LLMs (GPT, Claude, Gemini, LLaMA, Falcon, Grok) and major vision/model-vision integrations (BLIP-2, CLIP, ViT), detailing how MM-LLMs are built through end-to-end or staged training, and how tuning methods (full fine-tuning, PEFT, RLHF, prompt engineering) shape performance. The article also weighs open-source versus proprietary models, discusses ethical concerns and data governance, and reviews evaluation benchmarks and hallucination mitigation strategies. Overall, it provides a practical synthesis of approaches for deploying MM-LLMs across domains, emphasizing data quality, cost considerations, and responsible research practices.

Abstract

Large Language Models (LLMs) have recently emerged as a focal point of research and application, driven by their unprecedented ability to understand and generate text with human-like quality. Even more recently, LLMs have been extended into multi-modal large language models (MM-LLMs) which extends their capabilities to deal with image, video and audio information, in addition to text. This opens up applications like text-to-video generation, image captioning, text-to-speech, and more and is achieved either by retro-fitting an LLM with multi-modal capabilities, or building a MM-LLM from scratch. This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs. It covers the historical development of LLMs especially the advances enabled by transformer-based architectures like OpenAI's GPT series and Google's BERT, as well as the role of attention mechanisms in enhancing model performance. The paper includes coverage of the major and most important of the LLMs and MM-LLMs and also covers the techniques of model tuning, including fine-tuning and prompt engineering, which tailor pre-trained models to specific tasks or domains. Ethical considerations and challenges, such as data bias and model misuse, are also analysed to underscore the importance of responsible AI development and deployment. Finally, we discuss the implications of open-source versus proprietary models in AI research. Through this review, we provide insights into the transformative potential of MM-LLMs in various applications.
Paper Structure (37 sections, 2 figures, 2 tables)

This paper contains 37 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: A summary of how an input sequence is decomposed into query, key, and value vectors across the various attention mechanisms, taken from ainslie2023gqa.
  • Figure 2: A comparative summary of different training methods used for the reviewed MM-LLMs, all which follow a two-stage training process (taken from ye2023mplugowl).