Table of Contents
Fetching ...

MammothModa: Multi-Modal Large Language Model

Qi She, Junwen Pan, Xin Wan, Rui Zhang, Dawei Lu, Kai Huang

TL;DR

MammothModa tackles the challenge of fusing high-resolution and long-duration visual information with sophisticated language understanding in multimodal LLMs. It introduces a three-pronged design: (i) Visual Attention Experts integrated into the LLM to preserve language performance while handling visual tokens, (ii) a Visual Merger and Frame Position ID scheme to extend context efficiently for high-resolution and long videos, and (iii) a carefully curated bilingual multimodal dataset to reduce visual hallucinations. The model employs a three-phase training regime—Vision-Language Alignment, Multi-Task Pretraining, and Supervised Fine-Tuning—to achieve strong cross-modal performance with improved efficiency, demonstrated by state-of-the-art results on major benchmarks and robust qualitative capabilities. Collectively, MammothModa advances practical visual-language understanding by enabling precise, scalable processing of complex visual inputs while maintaining linguistic quality, making it well-suited for real-world vision-language tasks.

Abstract

In this report, we introduce MammothModa, yet another multi-modal large language model (MLLM) designed to achieve state-of-the-art performance starting from an elementary baseline. We focus on three key design insights: (i) Integrating Visual Capabilities while Maintaining Complex Language Understanding: In addition to the vision encoder, we incorporated the Visual Attention Experts into the LLM to enhance its visual capabilities. (ii) Extending Context Window for High-Resolution and Long-Duration Visual Feature: We explore the Visual Merger Module to effectively reduce the token number of high-resolution images and incorporated frame position ids to avoid position interpolation. (iii) High-Quality Bilingual Datasets: We meticulously curated and filtered a high-quality bilingual multimodal dataset to reduce visual hallucinations. With above recipe we build MammothModa that consistently outperforms the state-of-the-art models, e.g., LLaVA-series, across main real-world visual language benchmarks without bells and whistles.

MammothModa: Multi-Modal Large Language Model

TL;DR

MammothModa tackles the challenge of fusing high-resolution and long-duration visual information with sophisticated language understanding in multimodal LLMs. It introduces a three-pronged design: (i) Visual Attention Experts integrated into the LLM to preserve language performance while handling visual tokens, (ii) a Visual Merger and Frame Position ID scheme to extend context efficiently for high-resolution and long videos, and (iii) a carefully curated bilingual multimodal dataset to reduce visual hallucinations. The model employs a three-phase training regime—Vision-Language Alignment, Multi-Task Pretraining, and Supervised Fine-Tuning—to achieve strong cross-modal performance with improved efficiency, demonstrated by state-of-the-art results on major benchmarks and robust qualitative capabilities. Collectively, MammothModa advances practical visual-language understanding by enabling precise, scalable processing of complex visual inputs while maintaining linguistic quality, making it well-suited for real-world vision-language tasks.

Abstract

In this report, we introduce MammothModa, yet another multi-modal large language model (MLLM) designed to achieve state-of-the-art performance starting from an elementary baseline. We focus on three key design insights: (i) Integrating Visual Capabilities while Maintaining Complex Language Understanding: In addition to the vision encoder, we incorporated the Visual Attention Experts into the LLM to enhance its visual capabilities. (ii) Extending Context Window for High-Resolution and Long-Duration Visual Feature: We explore the Visual Merger Module to effectively reduce the token number of high-resolution images and incorporated frame position ids to avoid position interpolation. (iii) High-Quality Bilingual Datasets: We meticulously curated and filtered a high-quality bilingual multimodal dataset to reduce visual hallucinations. With above recipe we build MammothModa that consistently outperforms the state-of-the-art models, e.g., LLaVA-series, across main real-world visual language benchmarks without bells and whistles.
Paper Structure (23 sections, 5 tables)