Table of Contents
Fetching ...

S3: A Simple Strong Sample-effective Multimodal Dialog System

Elisei Rykov, Egor Malkershin, Alexander Panchenko

TL;DR

The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.

Abstract

In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.

S3: A Simple Strong Sample-effective Multimodal Dialog System

TL;DR

The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.

Abstract

In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.

Paper Structure

This paper contains 19 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An example of multimodal dialog.
  • Figure 2: Architecture of S$^3$ multimodal dialog system. All modalities are passed to specific encoders, and then modality objects are passed to modality projectors, which map them to token embeddings of a Large Language Model.
  • Figure 3: Example of json-formatted multimodal dialog data.
  • Figure 4: Architecture of our MLP modality projector, which maps features from the modality encoder to the language model.
  • Figure 5: A comparative analysis of the performance of S$^3$. According to the MMMU benchmark score, S$^3$ shows a competitive score in comparison to various models with a larger size and more training samples. The size of marks corresponds to the number of parameters in the language model.