Table of Contents
Fetching ...

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, Jian Tang

TL;DR

The paper tackles the high computational burden of Multimodal Large Language Models by proposing Multimodal Small Language Models (MSLMs) and introducing Mipha, a 3B-parameter model that rivals larger peers without extra data. It empirically analyzes design choices across visual backbones, small LMs, and optimization methods, finding that SigLIP with a Phi-2 back-end and full cross-modal fine-tuning yields strong results, while LoRA provides a data- and compute-efficient alternative. Mipha-3B delivers competitive or superior performance across multiple VQA and instruction-following benchmarks, often surpassing 7–13B MLLMs while using far less training data. The work offers practical guidelines for building strong MSLMs that approach MLLM capabilities and provides the codebase for replication.

Abstract

Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that rival the capabilities of MLLMs. Our code is available at https://github.com/zhuyiche/llava-phi.

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

TL;DR

The paper tackles the high computational burden of Multimodal Large Language Models by proposing Multimodal Small Language Models (MSLMs) and introducing Mipha, a 3B-parameter model that rivals larger peers without extra data. It empirically analyzes design choices across visual backbones, small LMs, and optimization methods, finding that SigLIP with a Phi-2 back-end and full cross-modal fine-tuning yields strong results, while LoRA provides a data- and compute-efficient alternative. Mipha-3B delivers competitive or superior performance across multiple VQA and instruction-following benchmarks, often surpassing 7–13B MLLMs while using far less training data. The work offers practical guidelines for building strong MSLMs that approach MLLM capabilities and provides the codebase for replication.

Abstract

Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that rival the capabilities of MLLMs. Our code is available at https://github.com/zhuyiche/llava-phi.
Paper Structure (22 sections, 9 figures, 4 tables)

This paper contains 22 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Evaluation Benchmarks Overview. We evaluate our model variants on academic-task-oriented benchmarks (left) as well as instruction-following benchmarks (middle). The answer is predicted by our proposed efficient MSLM: Mipha-3B. Additionally, we explore three key design spaces of MSLMs: 1) visual representation, 2) language model, and 3) optimization strategy (right).
  • Figure 2: Selection of Small Language Models. We evaluate four open-sourced SLMs on 8 benchmarks, and Phi-family showcases the best performance (left $\&$ middle). We discovered that MSLMs equipped with Phi-2-2.7B is able to identify that the monkey in the image is performing, a subtlety that other models fail to recognize (right).
  • Figure 3: Base vs. Instruct-tuned LMs. For MLLMs, we explore the differences using base LM and instruct-tuned LM. While the quantitative performance metrics appear similar, the qualitative results reveal differences (left). For example, when comparing the responses generated by MLLMs equipped with either a Base or SFT LM to those from models finetuned with RLHF or Safe-RLHF, the latter is observed to be more verbalized (right).
  • Figure 4: Choosing a Pretrained Vision Representation & Scaling Image Resolution. We evaluate various visual backbones, such as CLIP, SigLIP, DINOv2, and ViT-IN21K (left). We analyze model performance in relation to increasing image resolution (middle). We provide qualitative examples from multimodal models employing visual backbones at different image resolutions (right).
  • Figure 5: Frozen vs. Finetuning & Full-parameter tuning vs. LoRA. We study the impact of optimization strategies on the performance of small multi-modal models. Specifically, we explored the effects of activating or freezing the visual representation backbone and LM during the instruction tuning phase on model performance (left). Additionally, we confirmed that compared to full-parameter tuning, applying LoRA to MSLMs is equally effective and can significantly alleviate the training burden (right).
  • ...and 4 more figures