FinGPT: Democratizing Internet-scale Data for Financial Large Language Models

Xiao-Yang Liu; Guoxuan Wang; Hongyang Yang; Daochen Zha

FinGPT: Democratizing Internet-scale Data for Financial Large Language Models

Xiao-Yang Liu, Guoxuan Wang, Hongyang Yang, Daochen Zha

TL;DR

FinGPT addresses the lack of open, finance-domain data for FinLLMs by presenting an open-source, data-centric framework that automates real-time data collection from 34 sources, with a four-layer architecture and a lightweight fine-tuning workflow using LoRA/QLoRA and RLSP. It contrasts against BloombergGPT by emphasizing cost efficiency, openness, and rapid adaptability to market changes. The paper demonstrates three applications—robo-advisor, sentiment-based quantitative trading, and low-code development—and provides empirical insights on data curation quality, labeling strategies, and fine-tuning performance. By enabling access to Internet-scale financial data and low-cost model adaptation, FinGPT aims to accelerate open-finance research and practical deployment of FinLLMs.

Abstract

Large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like texts, which may potentially revolutionize the finance industry. However, existing LLMs often fall short in the financial field, which is mainly attributed to the disparities between general text data and financial text data. Unfortunately, there is only a limited number of financial text datasets available, and BloombergGPT, the first financial LLM (FinLLM), is close-sourced (only the training logs were released). In light of this, we aim to democratize Internet-scale financial data for LLMs, which is an open challenge due to diverse data sources, low signal-to-noise ratio, and high time-validity. To address the challenges, we introduce an open-sourced and data-centric framework, Financial Generative Pre-trained Transformer (FinGPT), that automates the collection and curation of real-time financial data from 34 diverse sources on the Internet, providing researchers and practitioners with accessible and transparent resources to develop their FinLLMs. Additionally, we propose a simple yet effective strategy for fine-tuning FinLLM using the inherent feedback from the market, dubbed Reinforcement Learning with Stock Prices (RLSP). We also adopt the Low-rank Adaptation (LoRA, QLoRA) method that enables users to customize their own FinLLMs from general-purpose LLMs at a low cost. Finally, we showcase several FinGPT applications, including robo-advisor, sentiment analysis for algorithmic trading, and low-code development. FinGPT aims to democratize FinLLMs, stimulate innovation, and unlock new opportunities in open finance. The codes have been open-sourced.

FinGPT: Democratizing Internet-scale Data for Financial Large Language Models

TL;DR

Abstract

Paper Structure (67 sections, 4 equations, 2 figures, 8 tables)

This paper contains 67 sections, 4 equations, 2 figures, 8 tables.

Introduction
Related Work
Data-centric FinGPT Framework for FinLLMs
Challenges of Training FinLLMs
Overview of FinGPT Framework
Proprietary Model BloombergGPT
Demoncratizing Internet-scale Financial Data
Financial Data Sources
Data Interface
Automated Real-Time Data Curation Pipeline
Data Cleaning
Document Filtering
Tokenization
Lightweight Adaptation of General-Purpoose LLMs to FinLLMs
Demonstrative Applications of FinGPT
...and 52 more sections

Figures (2)

Figure 1: Four-layer design of the FinGPT framework. Data Source layer orchestrates the acquisition of extensive financial data from various online sources, including news websites, social media platforms, company filings, and research datasets. Data Curation layer focuses on the real-time processing of the text data to filter noise. LLM layer encompasses various LLMs and fine-tuning methodologies, with a priority on lightweight adaptation, to keep the model updated and pertinent. Application layer is designed to demonstrate the practical applicability of FinGPT.
Figure 2: Financial data sources of FinGPT, including 19 news, 8 social media source, 3 filing source, and+ 4 academic dataset

FinGPT: Democratizing Internet-scale Data for Financial Large Language Models

TL;DR

Abstract

FinGPT: Democratizing Internet-scale Data for Financial Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)