Table of Contents
Fetching ...

TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models

Junlong Jia, Ying Hu, Xi Weng, Yiming Shi, Miao Li, Xingjian Zhang, Baichuan Zhou, Ziyu Liu, Jie Luo, Lei Huang, Ji Wu

TL;DR

This work addresses the complexity of designing and training small-scale large multimodal models by introducing TinyLLaVA Factory, a modular, factory-pattern codebase with interchangeable data, model, training recipe, trainer, and evaluator components. Built on PyTorch and Hugging Face and compatible with DeepSpeed, it provides standardized data pipelines and ready-to-use training recipes to enable pretraining and finetuning across small LMMs from 450M to 2.7B parameters. Through reproducing multiple TinyLLaVA variants and evaluating on eight benchmarks, the approach demonstrates reproducibility and slight performance gains over the original, while maintaining accessibility and low compute requirements. By lowering the barrier to research in small-scale LMMs and offering a community-friendly, extensible framework, this work has practical impact for researchers and practitioners exploring affordable multimodal AI systems.

Abstract

We present TinyLLaVA Factory, an open-source modular codebase for small-scale large multimodal models (LMMs) with a focus on simplicity of code implementations, extensibility of new features, and reproducibility of training results. Following the design philosophy of the factory pattern in software engineering, TinyLLaVA Factory modularizes the entire system into interchangeable components, with each component integrating a suite of cutting-edge models and methods, meanwhile leaving room for extensions to more features. In addition to allowing users to customize their own LMMs, TinyLLaVA Factory provides popular training recipes to let users pretrain and finetune their models with less coding effort. Empirical experiments validate the effectiveness of our codebase. The goal of TinyLLaVA Factory is to assist researchers and practitioners in exploring the wide landscape of designing and training small-scale LMMs with affordable computational resources.

TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models

TL;DR

This work addresses the complexity of designing and training small-scale large multimodal models by introducing TinyLLaVA Factory, a modular, factory-pattern codebase with interchangeable data, model, training recipe, trainer, and evaluator components. Built on PyTorch and Hugging Face and compatible with DeepSpeed, it provides standardized data pipelines and ready-to-use training recipes to enable pretraining and finetuning across small LMMs from 450M to 2.7B parameters. Through reproducing multiple TinyLLaVA variants and evaluating on eight benchmarks, the approach demonstrates reproducibility and slight performance gains over the original, while maintaining accessibility and low compute requirements. By lowering the barrier to research in small-scale LMMs and offering a community-friendly, extensible framework, this work has practical impact for researchers and practitioners exploring affordable multimodal AI systems.

Abstract

We present TinyLLaVA Factory, an open-source modular codebase for small-scale large multimodal models (LMMs) with a focus on simplicity of code implementations, extensibility of new features, and reproducibility of training results. Following the design philosophy of the factory pattern in software engineering, TinyLLaVA Factory modularizes the entire system into interchangeable components, with each component integrating a suite of cutting-edge models and methods, meanwhile leaving room for extensions to more features. In addition to allowing users to customize their own LMMs, TinyLLaVA Factory provides popular training recipes to let users pretrain and finetune their models with less coding effort. Empirical experiments validate the effectiveness of our codebase. The goal of TinyLLaVA Factory is to assist researchers and practitioners in exploring the wide landscape of designing and training small-scale LMMs with affordable computational resources.
Paper Structure (12 sections, 1 figure, 2 tables)