MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

Omkar Thawakar; Ashmal Vayani; Salman Khan; Hisham Cholakal; Rao M. Anwer; Michael Felsberg; Tim Baldwin; Eric P. Xing; Fahad Shahbaz Khan

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan

TL;DR

This work tackles the need for accurate yet efficient Small Language Models suitable for on-device deployment by introducing MobiLlama, a 0.5B parameter transformer that uses a shared FFN across all layers to dramatically reduce trainable parameters and pre-training cost. Trained from a larger base and pre-trained on 1.2T tokens from the Amber dataset, MobiLlama achieves competitive to state-of-the-art performance on nine benchmarks, with a stronger 0.8B variant that further improves results, while maintaining transparency through open release of data pipelines, code, weights, and checkpoints. A multimodal version (MobiLlama-V) combines the SLM with a vision encoder to demonstrate visual reasoning on standard VLM benchmarks. The approach emphasizes transparency, reproducibility, and on-device efficiency, supporting broader democratization of SLM research and deployment.

Abstract

"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. However, LLMs do not suit well for scenarios that require on-device processing, energy efficiency, low memory footprint, and response efficiency. These requisites are crucial for privacy, security, and sustainable deployment. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices. Our primary contribution is the introduction of an accurate and fully transparent open-source 0.5 billion (0.5B) parameter SLM, named MobiLlama, catering to the specific needs of resource-constrained computing with an emphasis on enhanced performance with reduced resource demands. MobiLlama is a SLM design that initiates from a larger model and applies a careful parameter sharing scheme to reduce both the pre-training and the deployment cost. Our work strives to not only bridge the gap in open-source SLMs but also ensures full transparency, where complete training data pipeline, training code, model weights, and over 300 checkpoints along with evaluation codes is available at : https://github.com/mbzuai-oryx/MobiLlama.

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

TL;DR

Abstract

Paper Structure (12 sections, 4 figures, 9 tables)

This paper contains 12 sections, 4 figures, 9 tables.

Introduction
Related Work
Method
Baseline SLM Design
Proposed SLM Design: MobiLlama
Towards Fully Transparent MobiLlama
Results
Conclusion
Acknowledgement
Appendix
MobiLlama-Chat
Qualitative Examples

Figures (4)

Figure 1: Comparison of our MobiLlama 0.5B and 0.8B models with recent OLMo-1.17B OLMo and TinyLlama-1.1B tinyllama in terms of pre-training tokens, pre-training time and memory, model parameters, overall accuracy across nine benchmarks and on-device efficiency (average battery consumption and average token/second on a PC with RTX2080Ti). Our MobiLlama achieves comparable accuracy while requiring significantly fewer pre-training data (1.2T tokens vs. 3T tokens), lesser pre-training time and GPU memory along with being efficient in terms of deployment on a resource constrained device.
Figure 2: Illustrative comparison of our MobiLlama with the two baselines. For each case, we show two transformer blocks denoted by different self-attention layers. In the case of both baseline1 and baseline2, a dedicated MLP block comprising three FFN layers is utilized for each transformer layer. In contrast, our MobiLlama utilizes a single MLP block (highlighted by the same color) that is shared across different transformer layers. This enables to increase the capacity of the network in terms of layers and hidden dimension size without any significant increase in the total number of trainable parameters.
Figure 3: Example responses from our MobiLlama across a variety of tasks, including creative storytelling, coding exercises, economic analysis, and cooking instructions. The responses highlight the models' ability to engage with both abstract concepts and practical, step-by-step processes, demonstrating its broad applicability.
Figure 4: Example responses of MobiLlama-$V$ in responding to visual stimuli across a range of scenarios.

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

TL;DR

Abstract

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

Authors

TL;DR

Abstract

Table of Contents

Figures (4)