MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal
Husein Zolkepli, Aisyah Razak, Kamarul Adha, Ariff Nazhan
TL;DR
MMMModal presents a multimodal large language model capable of processing multiple images, multiple audio segments, and multi-turn dialogues by aligning SigLIP visual features and Whisper audio features with an LLM via a two-stage training scheme. The authors build extensive synthetic datasets in the Malaysian context to train and evaluate multi-input multiturn capabilities, including visual-only, audio-only, and cross-modal data, and they introduce dedicated input tokens to integrate modalities. A two-stage finetuning strategy—visual and audio feature alignment followed by instruction finetuning on synthetic multiturn data—yields a model that can handle complex multi-image, multi-audio conversations. They report architectural and training insights, including mitigations for distributed training challenges, and release their data and code to support reproducibility and further research. The work advances practical multimodal reasoning in multilingual settings and lays groundwork for future video and broader-context extensions.
Abstract
Our contribution introduces a groundbreaking multimodal large language model designed to comprehend multi-images, multi-audio, and multi-images-multi-audio within a single multiturn session. Leveraging state-of-the-art models, we utilize the SigLIP encoder for visual inputs and the Whisper Encoder for audio inputs. Notably, this multimodal large language model is bilingual, proficient in understanding both English and Malay simultaneously. We proudly unveil two versions of this model: TinyLlama with 1.1B parameters, and Mistral with 7B parameters. With its ability to navigate diverse modalities and languages, our model represents a significant advancement for the Malaysian context and beyond. All models released at https://huggingface.co/collections/mesolitica/multimodal-malaysian-llm-65c6f893e03f78fa9e5c8859
