Table of Contents
Fetching ...

CFGPT: Chinese Financial Assistant with Large Language Model

Jiangtong Li, Yuxuan Bian, Guoxuan Wang, Yang Lei, Dawei Cheng, Zhijun Ding, Changjun Jiang

TL;DR

CFGPT tackles the need for domain-specific Chinese financial LLMs by introducing CFData for large-scale pretraining and supervised fine-tuning, and a two-stage CFLLM training regime based on InternLM-7B. It also presents CFAPP, a deployment framework that integrates LLMs with retrieval, structured reasoning, and multi-format outputs to handle real-world financial tasks. The approach yields improved capabilities on six financial tasks after continued pretraining and supervised fine-tuning, highlighting the value of data-centric FinLLMs for Chinese finance. The open-source release and deployment framework aim to accelerate research and practical adoption of domain-specific financial LLMs.

Abstract

Large language models (LLMs) have demonstrated great potential in natural language processing tasks within the financial domain. In this work, we present a Chinese Financial Generative Pre-trained Transformer framework, named CFGPT, which includes a dataset~(CFData) for pre-training and supervised fine-tuning, a financial LLM~(CFLLM) to adeptly manage financial texts, and a deployment framework~(CFAPP) designed to navigate real-world financial applications. The CFData comprising both a pre-training dataset and a supervised fine-tuning dataset, where the pre-training dataset collates Chinese financial data and analytics, alongside a smaller subset of general-purpose text with 584M documents and 141B tokens in total, and the supervised fine-tuning dataset is tailored for six distinct financial tasks, embodying various facets of financial analysis and decision-making with 1.5M instruction pairs and 1.5B tokens in total. The CFLLM, which is based on InternLM-7B to balance the model capability and size, is trained on CFData in two stage, continued pre-training and supervised fine-tuning. The CFAPP is centered on large language models (LLMs) and augmented with additional modules to ensure multifaceted functionality in real-world application. Our codes are released at https://github.com/TongjiFinLab/CFGPT.

CFGPT: Chinese Financial Assistant with Large Language Model

TL;DR

CFGPT tackles the need for domain-specific Chinese financial LLMs by introducing CFData for large-scale pretraining and supervised fine-tuning, and a two-stage CFLLM training regime based on InternLM-7B. It also presents CFAPP, a deployment framework that integrates LLMs with retrieval, structured reasoning, and multi-format outputs to handle real-world financial tasks. The approach yields improved capabilities on six financial tasks after continued pretraining and supervised fine-tuning, highlighting the value of data-centric FinLLMs for Chinese finance. The open-source release and deployment framework aim to accelerate research and practical adoption of domain-specific financial LLMs.

Abstract

Large language models (LLMs) have demonstrated great potential in natural language processing tasks within the financial domain. In this work, we present a Chinese Financial Generative Pre-trained Transformer framework, named CFGPT, which includes a dataset~(CFData) for pre-training and supervised fine-tuning, a financial LLM~(CFLLM) to adeptly manage financial texts, and a deployment framework~(CFAPP) designed to navigate real-world financial applications. The CFData comprising both a pre-training dataset and a supervised fine-tuning dataset, where the pre-training dataset collates Chinese financial data and analytics, alongside a smaller subset of general-purpose text with 584M documents and 141B tokens in total, and the supervised fine-tuning dataset is tailored for six distinct financial tasks, embodying various facets of financial analysis and decision-making with 1.5M instruction pairs and 1.5B tokens in total. The CFLLM, which is based on InternLM-7B to balance the model capability and size, is trained on CFData in two stage, continued pre-training and supervised fine-tuning. The CFAPP is centered on large language models (LLMs) and augmented with additional modules to ensure multifaceted functionality in real-world application. Our codes are released at https://github.com/TongjiFinLab/CFGPT.
Paper Structure (29 sections, 5 figures, 2 tables)

This paper contains 29 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The preprocess steps of each sub-dataset in CFData pre-training dataset.
  • Figure 2: The working pipeline of CFAPP framework
  • Figure 3: The CFAPP framework
  • Figure 4: The example of our Content Summary to summarize the shareholding relationship and executive chart based on the corporate announcements.
  • Figure 5: The example of our Causal Reasoning to answer the open-domain question.