Table of Contents
Fetching ...

Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática

Neemias da Silva, Júlio C. W. Scholz, John Harrison, Marina Borges, Paulo Ávila, Frances A Santos, Myriam Delgado, Rodrigo Minetto, Thiago H Silva

TL;DR

This chapter presents the main fundamentals of MLLMs and emblematic models, and discusses the challenges and highlights promising trends.

Abstract

Multimodal Large Language Models (MLLMs) combine the natural language understanding and generation capabilities of LLMs with perception skills in modalities such as image and audio, representing a key advancement in contemporary AI. This chapter presents the main fundamentals of MLLMs and emblematic models. Practical techniques for preprocessing, prompt engineering, and building multimodal pipelines with LangChain and LangGraph are also explored. For further practical study, supplementary material is publicly available online: https://github.com/neemiasbsilva/MLLMs-Teoria-e-Pratica. Finally, the chapter discusses the challenges and highlights promising trends.

Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática

TL;DR

This chapter presents the main fundamentals of MLLMs and emblematic models, and discusses the challenges and highlights promising trends.

Abstract

Multimodal Large Language Models (MLLMs) combine the natural language understanding and generation capabilities of LLMs with perception skills in modalities such as image and audio, representing a key advancement in contemporary AI. This chapter presents the main fundamentals of MLLMs and emblematic models. Practical techniques for preprocessing, prompt engineering, and building multimodal pipelines with LangChain and LangGraph are also explored. For further practical study, supplementary material is publicly available online: https://github.com/neemiasbsilva/MLLMs-Teoria-e-Pratica. Finally, the chapter discusses the challenges and highlights promising trends.
Paper Structure (31 sections, 16 figures)

This paper contains 31 sections, 16 figures.

Figures (16)

  • Figure 1.1: Visão geral da arquitetura de um MLLM: entradas de diferentes modalidades são codificadas e alinhadas à entrada textual para serem processadas de forma conjunta por um LLM. A saída do MLLM pode ser textual ou multimodal (via gerador específico). Figura adaptada de yin2024survey.
  • Figure 1.2: Linha do tempo simplificada ilustrando a evolução de alguns MLLMs emblemáticos, mostrando a diversificação entre modelos de código aberto e fechado (indisponível publicamente). Figura adaptada de yin2024survey.
  • Figure 1.3: Visão geral do funcionamento de um LLM. Figura adaptada de transf.
  • Figure 1.4: Visão geral do CLIP. Figura adaptada de radford2021learning.
  • Figure 1.5: Processo de construção de embeddings visuais (a) e textuais (b) em modelos multimodais. Figura adaptada de raschka2024.
  • ...and 11 more figures