Table of Contents
Fetching ...

FinLoRA: Finetuning Quantized Financial Large Language Models Using Low-Rank Adaptation

Dannong Wang, Daniel Kim, Bo Jin, Xingjian Zhao, Tianfan Fu, Steve Yang, Xiao-Yang Liu

TL;DR

The paper tackles the challenge of privately finetuning financial LLMs under memory constraints by introducing FinLoRA, which combines Quantized Low-Rank Adaptation (QLoRA) with data and pipeline parallelism. By expressing updates as $W = W_0 + \Delta W$ with $\Delta W = BA$ and applying $8$-bit or $4$-bit quantization, the method dramatically reduces memory while preserving performance; finetuning and inference are further accelerated with Distributed Data Parallel and pipeline parallelism. Empirical results on sentiment analysis, NER, headlines, and XBRL tasks show up to $48\%$ accuracy improvements on average over baselines, with 4-bit/4-bit configurations achieving memory savings close to 8-bit setups and even enabling 70B-scale models to run locally. The work demonstrates practical, scalable local FinLLM deployment on commodity GPUs, offering a path for financial institutions to customize models without centralized access to data. $W = W_0 + \Delta W$ with $\Delta W = BA$, and $y = W_0 x + BAx$ capture the core finetuning/inference dynamic, underscoring the memory-efficient advantages of QLoRA.$

Abstract

Finetuned large language models (LLMs) have shown remarkable performance in financial tasks, such as sentiment analysis and information retrieval. Due to privacy concerns, finetuning and deploying Financial LLMs (FinLLMs) locally are crucial for institutions. However, finetuning FinLLMs poses challenges including GPU memory constraints and long input sequences. In this paper, we employ quantized low-rank adaptation (QLoRA) to finetune FinLLMs, which leverage low-rank matrix decomposition and quantization techniques to significantly reduce computational requirements while maintaining high model performance. We also employ data and pipeline parallelism to enable local finetuning using cost-effective, widely accessible GPUs. Experiments on financial datasets demonstrate that our method achieves substantial improvements in accuracy, GPU memory usage, and time efficiency, underscoring the potential of lowrank methods for scalable and resource-efficient LLM finetuning.

FinLoRA: Finetuning Quantized Financial Large Language Models Using Low-Rank Adaptation

TL;DR

The paper tackles the challenge of privately finetuning financial LLMs under memory constraints by introducing FinLoRA, which combines Quantized Low-Rank Adaptation (QLoRA) with data and pipeline parallelism. By expressing updates as with and applying -bit or -bit quantization, the method dramatically reduces memory while preserving performance; finetuning and inference are further accelerated with Distributed Data Parallel and pipeline parallelism. Empirical results on sentiment analysis, NER, headlines, and XBRL tasks show up to accuracy improvements on average over baselines, with 4-bit/4-bit configurations achieving memory savings close to 8-bit setups and even enabling 70B-scale models to run locally. The work demonstrates practical, scalable local FinLLM deployment on commodity GPUs, offering a path for financial institutions to customize models without centralized access to data. with , and capture the core finetuning/inference dynamic, underscoring the memory-efficient advantages of QLoRA.$

Abstract

Finetuned large language models (LLMs) have shown remarkable performance in financial tasks, such as sentiment analysis and information retrieval. Due to privacy concerns, finetuning and deploying Financial LLMs (FinLLMs) locally are crucial for institutions. However, finetuning FinLLMs poses challenges including GPU memory constraints and long input sequences. In this paper, we employ quantized low-rank adaptation (QLoRA) to finetune FinLLMs, which leverage low-rank matrix decomposition and quantization techniques to significantly reduce computational requirements while maintaining high model performance. We also employ data and pipeline parallelism to enable local finetuning using cost-effective, widely accessible GPUs. Experiments on financial datasets demonstrate that our method achieves substantial improvements in accuracy, GPU memory usage, and time efficiency, underscoring the potential of lowrank methods for scalable and resource-efficient LLM finetuning.

Paper Structure

This paper contains 25 sections, 2 equations, 6 tables.