Table of Contents
Fetching ...

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

Xudong Lu, Yinghao Chen, Renshou Wu, Haohao Gao, Xi Chen, Xue Yang, Xiangyu Zhao, Aojun Zhou, Fangyuan Li, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li

TL;DR

GenieBlue tackles the challenge of running both strong language and multimodal capabilities on mobile LLMs by avoiding MoE architectures and preserving the original LLM's language skills. It freezes the base LLM and trains replicated transformer blocks every $n$th layer plus lightweight LoRA modules to form a separate MLLM path, enabling non-shared base deployment for pure-language tasks while supporting multimodal inference. Empirical results show GenieBlue achieves multimodal performance close to fully fine-tuned MLLMs (over 97% retention) with no loss in LLM performance under the non-shared deployment, and it is demonstrably deployable on smartphone NPUs with practical latency. This work provides a hardware-conscious blueprint for edge-ready, dual-capability LLMs, reducing memory and deployment constraints while maintaining user-facing language quality.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices. However, challenges persist in maintaining strong language capabilities and ensuring hardware compatibility, both of which are crucial for user experience and practical deployment efficiency. In our deployment process, we observe that existing MLLMs often face performance degradation on pure language tasks, and the current NPU platforms on smartphones do not support the MoE architecture, which is commonly used to preserve pure language capabilities during multimodal training. To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. GenieBlue freezes the original LLM parameters during MLLM training to maintain pure language capabilities. It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules. This approach preserves language capabilities while achieving comparable multimodal performance through extensive training. Deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

TL;DR

GenieBlue tackles the challenge of running both strong language and multimodal capabilities on mobile LLMs by avoiding MoE architectures and preserving the original LLM's language skills. It freezes the base LLM and trains replicated transformer blocks every th layer plus lightweight LoRA modules to form a separate MLLM path, enabling non-shared base deployment for pure-language tasks while supporting multimodal inference. Empirical results show GenieBlue achieves multimodal performance close to fully fine-tuned MLLMs (over 97% retention) with no loss in LLM performance under the non-shared deployment, and it is demonstrably deployable on smartphone NPUs with practical latency. This work provides a hardware-conscious blueprint for edge-ready, dual-capability LLMs, reducing memory and deployment constraints while maintaining user-facing language quality.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices. However, challenges persist in maintaining strong language capabilities and ensuring hardware compatibility, both of which are crucial for user experience and practical deployment efficiency. In our deployment process, we observe that existing MLLMs often face performance degradation on pure language tasks, and the current NPU platforms on smartphones do not support the MoE architecture, which is commonly used to preserve pure language capabilities during multimodal training. To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. GenieBlue freezes the original LLM parameters during MLLM training to maintain pure language capabilities. It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules. This approach preserves language capabilities while achieving comparable multimodal performance through extensive training. Deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.

Paper Structure

This paper contains 23 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: CogVLM wang2023cogvlm replicates an identical visual expert module alongside each transformer block to handle multimodal inputs.
  • Figure 2: Overview of GenieBlue. We replicate the transformer blocks at every quarter interval of the layers in the LLM and incorporate LoRA modules into the other transformer blocks. During multimodal training, we freeze the original LLM while fully training the replicated transformer blocks and the added LoRA parameters. For pure-text inference, we utilize the original LLM. For multimodal inference, we replace the original blocks with the trained transformer blocks at every quarter interval and add LoRA to the remaining transformer blocks. This non-shared base approach avoids the MoE structure while decoupling the inference processes of the LLM and MLLM.
  • Figure 3: Structure detail of GenieBlue during the MLLM inference process.