Table of Contents
Fetching ...

HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

Taiqiang Wu, Chenchen Ding, Wenyong Zhou, Yuxin Cheng, Xincheng Feng, Shuqi Wang, Chufan Shi, Zhengwu Liu, Ngai Wong

TL;DR

This work tackles the challenge of efficiently finetuning and deploying large language models on energy-efficient hardware by combining LoRA with a hybrid CIM architecture that places pretrained weights on RRAM and LoRA on SRAM. The authors introduce HaLoRA, a noise-aware training objective that minimizes the discrepancy between optimal updates under ideal and noisy (RRAM) conditions, via a regularization term $\mathcal{L}_{reg}=||\mathbf{A}\mathbf{A}^T||_2+||\mathbf{B}^T\mathbf{B}||_2$ in the total loss $\mathcal{L}_{total}=\mathcal{L}+\mu\mathcal{L}_{reg}$. Empirical results on LLaMA 3.2 1B/3B show HaLoRA substantially improves performance under hardware noise (e.g., up to $22.7$ point gains at $\sigma=0.02$) and reduces degradation and variance compared to vanilla LoRA, validating the approach for robust, energy-efficient LLM deployment on hybrid CIM. The work also provides a practical framework for simulating CIM non-idealities and demonstrates that larger models tend to exhibit greater noise robustness when paired with HaLoRA, with future directions including quantization and harder reasoning tasks.

Abstract

Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.

HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

TL;DR

This work tackles the challenge of efficiently finetuning and deploying large language models on energy-efficient hardware by combining LoRA with a hybrid CIM architecture that places pretrained weights on RRAM and LoRA on SRAM. The authors introduce HaLoRA, a noise-aware training objective that minimizes the discrepancy between optimal updates under ideal and noisy (RRAM) conditions, via a regularization term in the total loss . Empirical results on LLaMA 3.2 1B/3B show HaLoRA substantially improves performance under hardware noise (e.g., up to point gains at ) and reduces degradation and variance compared to vanilla LoRA, validating the approach for robust, energy-efficient LLM deployment on hybrid CIM. The work also provides a practical framework for simulating CIM non-idealities and demonstrates that larger models tend to exhibit greater noise robustness when paired with HaLoRA, with future directions including quantization and harder reasoning tasks.

Abstract

Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.

Paper Structure

This paper contains 14 sections, 11 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: One case during inference for noise-free and noisy LoRA-finetuned LLMs. The non-ideality of RRAM imposes noise on pretrained weights and thus hurts the performance.
  • Figure 2: The train and deploy stages for proposed HaLoRA. (a) During the training stage, HaLoRA incorporates an additional loss regularization term with sampled noise to enhance model robustness. (b) In the deploy stage, the finetuned LLM is mapped to a hybrid CIM architecture formed by RRAM and SRAM-based CIM macros, leveraging their respective advantages.
  • Figure 3: The performance of HaLoRA with different values of $\mu$ and vanilla LoRA on the OBQA and SIQA datasets.