Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang
TL;DR
The paper addresses the high operational costs of large multimodal vision-language models by proposing Xmodel-VLM, a 1.1B-parameter baseline that achieves competitive performance with much smaller compute requirements. It places a CLIP ViT-L/14 visual encoder with a from-scratch Xmodel-LM-1.1B and a lightweight XDP projector into a two-stage training workflow inspired by LLaVA, including pretraining on filtered CC3M data and instruction-following fine-tuning. Comprehensive experiments and targeted ablations demonstrate that the approach yields strong multimodal performance and faster inference on consumer hardware, while revealing meaningful trade-offs between projector design, token count, and LM size. The work provides a cost-effective, open-source baseline that can accelerate deployment and iteration of multimodal systems in practical settings.
Abstract
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.
