Table of Contents
Fetching ...

BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark

Dakuan Lu, Hengkui Wu, Jiaqing Liang, Yipei Xu, Qianyu He, Yipeng Geng, Mengkun Han, Yingsi Xin, Yanghua Xiao

TL;DR

This work tackles the gap in Chinese financial NLP by delivering a large-scale domain-specific pre-trained model, FinT5, trained on FinCorpus (~300GB) and enhanced with a novel knowledge-masking approach (KETM). It also introduces CFLEB, a comprehensive Chinese financial NLP benchmark with practical tasks and leaderboards to drive fair comparison. FinT5, especially the 1B-parameter large version and the KE variant, outperforms existing Chinese financial PLMs across multiple tasks, demonstrating the value of domain-focused pre-training and knowledge integration. The release of FinCorpus, FinT5, and CFLEB provides a substantial, open-resource foundation for research and real-world applications in Chinese finance.

Abstract

To advance Chinese financial natural language processing (NLP), we introduce BBT-FinT5, a new Chinese financial pre-training language model based on the T5 model. To support this effort, we have built BBT-FinCorpus, a large-scale financial corpus with approximately 300GB of raw text from four different sources. In general domain NLP, comprehensive benchmarks like GLUE and SuperGLUE have driven significant advancements in language model pre-training by enabling head-to-head comparisons among models. Drawing inspiration from these benchmarks, we propose BBT-CFLEB, a Chinese Financial Language understanding and generation Evaluation Benchmark, which includes six datasets covering both understanding and generation tasks. Our aim is to facilitate research in the development of NLP within the Chinese financial domain. Our model, corpus and benchmark are released at https://github.com/ssymmetry/BBT-FinCUGE-Applications. Our work belongs to the Big Bang Transformer (BBT), a large-scale pre-trained language model project.

BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark

TL;DR

This work tackles the gap in Chinese financial NLP by delivering a large-scale domain-specific pre-trained model, FinT5, trained on FinCorpus (~300GB) and enhanced with a novel knowledge-masking approach (KETM). It also introduces CFLEB, a comprehensive Chinese financial NLP benchmark with practical tasks and leaderboards to drive fair comparison. FinT5, especially the 1B-parameter large version and the KE variant, outperforms existing Chinese financial PLMs across multiple tasks, demonstrating the value of domain-focused pre-training and knowledge integration. The release of FinCorpus, FinT5, and CFLEB provides a substantial, open-resource foundation for research and real-world applications in Chinese finance.

Abstract

To advance Chinese financial natural language processing (NLP), we introduce BBT-FinT5, a new Chinese financial pre-training language model based on the T5 model. To support this effort, we have built BBT-FinCorpus, a large-scale financial corpus with approximately 300GB of raw text from four different sources. In general domain NLP, comprehensive benchmarks like GLUE and SuperGLUE have driven significant advancements in language model pre-training by enabling head-to-head comparisons among models. Drawing inspiration from these benchmarks, we propose BBT-CFLEB, a Chinese Financial Language understanding and generation Evaluation Benchmark, which includes six datasets covering both understanding and generation tasks. Our aim is to facilitate research in the development of NLP within the Chinese financial domain. Our model, corpus and benchmark are released at https://github.com/ssymmetry/BBT-FinCUGE-Applications. Our work belongs to the Big Bang Transformer (BBT), a large-scale pre-trained language model project.
Paper Structure (27 sections, 1 figure, 4 tables)