HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

Taiqiang Wu; Chenchen Ding; Wenyong Zhou; Yuxin Cheng; Xincheng Feng; Shuqi Wang; Chufan Shi; Zhengwu Liu; Ngai Wong

HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

Taiqiang Wu, Chenchen Ding, Wenyong Zhou, Yuxin Cheng, Xincheng Feng, Shuqi Wang, Chufan Shi, Zhengwu Liu, Ngai Wong

TL;DR

This work tackles the challenge of efficiently finetuning and deploying large language models on energy-efficient hardware by combining LoRA with a hybrid CIM architecture that places pretrained weights on RRAM and LoRA on SRAM. The authors introduce HaLoRA, a noise-aware training objective that minimizes the discrepancy between optimal updates under ideal and noisy (RRAM) conditions, via a regularization term $\mathcal{L}_{reg}=||\mathbf{A}\mathbf{A}^T||_2+||\mathbf{B}^T\mathbf{B}||_2$ in the total loss $\mathcal{L}_{total}=\mathcal{L}+\mu\mathcal{L}_{reg}$. Empirical results on LLaMA 3.2 1B/3B show HaLoRA substantially improves performance under hardware noise (e.g., up to $22.7$ point gains at $\sigma=0.02$) and reduces degradation and variance compared to vanilla LoRA, validating the approach for robust, energy-efficient LLM deployment on hybrid CIM. The work also provides a practical framework for simulating CIM non-idealities and demonstrates that larger models tend to exhibit greater noise robustness when paired with HaLoRA, with future directions including quantization and harder reasoning tasks.

Abstract

Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.

HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

TL;DR

Abstract

HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)