Table of Contents
Fetching ...

EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction

Hsi-Che Lin, Yu-Chu Yu, Kai-Po Chang, Yu-Chiang Frank Wang

TL;DR

EMLoC addresses the memory bottleneck in fine-tuning large foundation models by training on a downstream-aware lightweight emulator constructed via activation-aware SVD, enabling LoRA-based fine-tuning under inference memory budgets. It then applies a LoRA correction to compensate for misalignment between the emulator and the full model, allowing the learned updates to transfer effectively during inference. Across seven VQA benchmarks and NLP tasks, EMLoC outperforms strong baselines and scales to 38B-parameter models on consumer GPUs without quantization, closely approaching full-model fine-tuning performance. This approach democratizes customization of large models for domain adaptation and personalization while maintaining practical resource requirements. In formal terms, the emulator replaces weight matrices $W$ with a low-rank approximation $W^{\mathcal{E}}$ (constructed by SVD-LLM) and trains LoRA modules $\Lambda$ on the emulator; the final inference uses the original weights with compensated LoRA, $W + \Lambda^{c}$, ensuring consistency via the condition $x^\top(W + \Lambda^{c}) = x^\top(W^{\mathcal{E}} + \Lambda)$ for active inputs $x \in \mathcal{V}_\Lambda$. The method is validated through extensive experiments and ablations, showing significant memory savings with minimal performance loss and robust transfer to larger models and generative personalization tasks.

Abstract

Open-source foundation models have seen rapid adoption and development, enabling powerful general-purpose capabilities across diverse domains. However, fine-tuning large foundation models for domain-specific or personalized tasks remains prohibitively expensive for most users due to the significant memory overhead beyond that of inference. We introduce EMLoC, an Emulator-based Memory-efficient fine-tuning framework with LoRA Correction, which enables model fine-tuning within the same memory budget required for inference. EMLoC constructs a task-specific light-weight emulator using activation-aware singular value decomposition (SVD) on a small downstream calibration set. Fine-tuning then is performed on this lightweight emulator via LoRA. To tackle the misalignment between the original model and the compressed emulator, we propose a novel compensation algorithm to correct the fine-tuned LoRA module, which thus can be merged into the original model for inference. EMLoC supports flexible compression ratios and standard training pipelines, making it adaptable to a wide range of applications. Extensive experiments demonstrate that EMLoC outperforms other baselines across multiple datasets and modalities. Moreover, without quantization, EMLoC enables fine-tuning of a 38B model, which originally required 95GB of memory, on a single 24GB consumer GPU-bringing efficient and practical model adaptation to individual users.

EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction

TL;DR

EMLoC addresses the memory bottleneck in fine-tuning large foundation models by training on a downstream-aware lightweight emulator constructed via activation-aware SVD, enabling LoRA-based fine-tuning under inference memory budgets. It then applies a LoRA correction to compensate for misalignment between the emulator and the full model, allowing the learned updates to transfer effectively during inference. Across seven VQA benchmarks and NLP tasks, EMLoC outperforms strong baselines and scales to 38B-parameter models on consumer GPUs without quantization, closely approaching full-model fine-tuning performance. This approach democratizes customization of large models for domain adaptation and personalization while maintaining practical resource requirements. In formal terms, the emulator replaces weight matrices with a low-rank approximation (constructed by SVD-LLM) and trains LoRA modules on the emulator; the final inference uses the original weights with compensated LoRA, , ensuring consistency via the condition for active inputs . The method is validated through extensive experiments and ablations, showing significant memory savings with minimal performance loss and robust transfer to larger models and generative personalization tasks.

Abstract

Open-source foundation models have seen rapid adoption and development, enabling powerful general-purpose capabilities across diverse domains. However, fine-tuning large foundation models for domain-specific or personalized tasks remains prohibitively expensive for most users due to the significant memory overhead beyond that of inference. We introduce EMLoC, an Emulator-based Memory-efficient fine-tuning framework with LoRA Correction, which enables model fine-tuning within the same memory budget required for inference. EMLoC constructs a task-specific light-weight emulator using activation-aware singular value decomposition (SVD) on a small downstream calibration set. Fine-tuning then is performed on this lightweight emulator via LoRA. To tackle the misalignment between the original model and the compressed emulator, we propose a novel compensation algorithm to correct the fine-tuned LoRA module, which thus can be merged into the original model for inference. EMLoC supports flexible compression ratios and standard training pipelines, making it adaptable to a wide range of applications. Extensive experiments demonstrate that EMLoC outperforms other baselines across multiple datasets and modalities. Moreover, without quantization, EMLoC enables fine-tuning of a 38B model, which originally required 95GB of memory, on a single 24GB consumer GPU-bringing efficient and practical model adaptation to individual users.

Paper Structure

This paper contains 50 sections, 16 equations, 7 figures, 16 tables, 1 algorithm.

Figures (7)

  • Figure 1: The dilemma caused by additional memory overhead during fine-tuning.(a) Users opt for a smaller 8B model, sacrificing emergent capabilities and underutilizing available hardware. (b) Use of a larger 26B model requiring memory exceeding the hardware limit even with LoRA hu2022lora and gradient checkpointing chen2016training techniques. (c) Our EMLoC utilizes a smaller model during fine-tuning, allowing the same budget for both training and inference.
  • Figure 2: Overview of EMLoC.Stage 1: Construct a downstream-aware lightweight emulator. Stage 2: Fine-tune the emulator via LoRA, allowing reduced memory costs. Stage 3: Update the LoRA module to compensate the misalignment between the full model and emulator.
  • Figure 3: LoRA correction to compensate model misalignment. To alleviate the misalignment that arises from fine-tuning the lightweight emulator but running inference on the original model, LoRA parameters are corrected via feature spaces between the emulator and the original model.
  • Figure 4: Sensitivity analysis of $\lambda$ in the LoRA correction algorithm. We plot performance on WC-VQA under different $\lambda$.
  • Figure 5: We plot performance on WC-VQA under different number of calibration data.
  • ...and 2 more figures