Table of Contents
Fetching ...

Empower Vision Applications with LoRA LMM

Liang Mi, Weijun Wang, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yunchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu

TL;DR

VaLoRA addresses the latency-accuracy trade-off in vision applications by deploying LoRA adapters within Large Multimodal Models through an end-to-end system. It introduces an accuracy-aware adapter generation pipeline, a high-efficiency Adaptive-Tiling Matrix Multiplication (ATMM) batching operator, and a flexible adapter orchestration mechanism including swift mode switching and deLoRA mixture inference. The approach yields 24-62% accuracy gains and 20-89% end-to-end latency reductions across five tasks and three LMMs, outperforming state-of-the-art LoRA-serving systems. This work demonstrates that a carefully designed LoRA LMM serving stack can enable accurate, scalable, and low-latency vision applications with a single foundation model.

Abstract

Large Multimodal Models (LMMs) have shown significant progress in various complex vision tasks with the solid linguistic and reasoning capacity inherited from large language models (LMMs). Low-rank adaptation (LoRA) offers a promising method to integrate external knowledge into LMMs, compensating for their limitations on domain-specific tasks. However, the existing LoRA model serving is excessively computationally expensive and causes extremely high latency. In this paper, we present an end-to-end solution that empowers diverse vision tasks and enriches vision applications with LoRA LMMs. Our system, VaLoRA, enables accurate and efficient vision tasks by 1) an accuracy-aware LoRA adapter generation approach that generates LoRA adapters rich in domain-specific knowledge to meet application-specific accuracy requirements, 2) an adaptive-tiling LoRA adapters batching operator that efficiently computes concurrent heterogeneous LoRA adapters, and 3) a flexible LoRA adapter orchestration mechanism that manages application requests and LoRA adapters to achieve the lowest average response latency. We prototype VaLoRA on five popular vision tasks on three LMMs. Experiment results reveal that VaLoRA improves 24-62% of the accuracy compared to the original LMMs and reduces 20-89% of the latency compared to the state-of-the-art LoRA model serving systems.

Empower Vision Applications with LoRA LMM

TL;DR

VaLoRA addresses the latency-accuracy trade-off in vision applications by deploying LoRA adapters within Large Multimodal Models through an end-to-end system. It introduces an accuracy-aware adapter generation pipeline, a high-efficiency Adaptive-Tiling Matrix Multiplication (ATMM) batching operator, and a flexible adapter orchestration mechanism including swift mode switching and deLoRA mixture inference. The approach yields 24-62% accuracy gains and 20-89% end-to-end latency reductions across five tasks and three LMMs, outperforming state-of-the-art LoRA-serving systems. This work demonstrates that a carefully designed LoRA LMM serving stack can enable accurate, scalable, and low-latency vision applications with a single foundation model.

Abstract

Large Multimodal Models (LMMs) have shown significant progress in various complex vision tasks with the solid linguistic and reasoning capacity inherited from large language models (LMMs). Low-rank adaptation (LoRA) offers a promising method to integrate external knowledge into LMMs, compensating for their limitations on domain-specific tasks. However, the existing LoRA model serving is excessively computationally expensive and causes extremely high latency. In this paper, we present an end-to-end solution that empowers diverse vision tasks and enriches vision applications with LoRA LMMs. Our system, VaLoRA, enables accurate and efficient vision tasks by 1) an accuracy-aware LoRA adapter generation approach that generates LoRA adapters rich in domain-specific knowledge to meet application-specific accuracy requirements, 2) an adaptive-tiling LoRA adapters batching operator that efficiently computes concurrent heterogeneous LoRA adapters, and 3) a flexible LoRA adapter orchestration mechanism that manages application requests and LoRA adapters to achieve the lowest average response latency. We prototype VaLoRA on five popular vision tasks on three LMMs. Experiment results reveal that VaLoRA improves 24-62% of the accuracy compared to the original LMMs and reduces 20-89% of the latency compared to the state-of-the-art LoRA model serving systems.

Paper Structure

This paper contains 32 sections, 1 equation, 26 figures, 3 tables, 2 algorithms.

Figures (26)

  • Figure 1: Illustration of LMM inference. Qwen-VL-7B QwenVL generates the right action recognition answer to a piece of data from UCF-101 dataset soomro2012ucf101 and the corresponding prompt.
  • Figure 2: LoRA model inference. (a) Unmerge mode supports computing multiple different LoRA adapters in a batch. $A_1$ and $B_1$ constitute LoRA adapter #1. (b) Merge mode supports no-extra-delay inference but only one adapter at once.
  • Figure 3: The potential of LMM. (a) To ground the airplanes in remote sensing view in zero-shot, LMM Qwen-VL, in general, delivers 67.2% accuracy v.s. the 18.3% of YOLO glenn2021YOLOV5. (b) In VQA, Qwen-VL yields 78.8% accuracy v.s. the 73.3% of OSCAR li2020oscar.
  • Figure 3: Scales to multiple GPUs.
  • Figure 4: LoRA adapters with domain-specific knowledge improve the Qwen-VL's accuracy on target tasks.
  • ...and 21 more figures