Table of Contents
Fetching ...

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

Xiaojun Wu, Junxi Liu, Huanyi Su, Zhouchi Lin, Yiyan Qi, Chengjin Xu, Jiajun Su, Jiajie Zhong, Fuwei Wang, Saizhuo Wang, Fengrui Hua, Jia Li, Jian Guo

TL;DR

The paper introduces Golden Touchstone, the first comprehensive bilingual benchmark for English-Chinese financial NLP, spanning eight task types and 22 datasets to evaluate both understanding and generation capabilities of financial LLMs. It pairs this benchmark with Touchstone-GPT, an open-source financial LLM trained via continual pre-training and instruction tuning, and provides detailed experimental results across GPT-4o, Llama-3, Qwen-3, FinGPT, FinMA, and DISC-FinLLM. The findings reveal strong performance in sentiment analysis and certain structured tasks but persistent gaps in relation extraction, summarization, and stock-movement prediction, highlighting the need for targeted domain adaptation and better bilingual alignment. The work advances the field by offering a practical, extensible bilingual evaluation framework and releasing open-source datasets and tools to foster ongoing development of financial LLMs in multilingual settings.

Abstract

As large language models (LLMs) increasingly permeate the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. Existing financial benchmarks often suffer from limited language and task coverage, low-quality datasets, and inadequate adaptability for LLM evaluation. To address these limitations, we introduce Golden Touchstone, a comprehensive bilingual benchmark for financial LLMs, encompassing eight core financial NLP tasks in both Chinese and English. Developed from extensive open-source data collection and industry-specific demands, this benchmark thoroughly assesses models' language understanding and generation capabilities. Through comparative analysis of major models such as GPT-4o, Llama3, FinGPT, and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-source Touchstone-GPT, a financial LLM trained through continual pre-training and instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks. This research provides a practical evaluation tool for financial LLMs and guides future development and optimization. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at https://github.com/IDEA-FinAI/Golden-Touchstone.

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

TL;DR

The paper introduces Golden Touchstone, the first comprehensive bilingual benchmark for English-Chinese financial NLP, spanning eight task types and 22 datasets to evaluate both understanding and generation capabilities of financial LLMs. It pairs this benchmark with Touchstone-GPT, an open-source financial LLM trained via continual pre-training and instruction tuning, and provides detailed experimental results across GPT-4o, Llama-3, Qwen-3, FinGPT, FinMA, and DISC-FinLLM. The findings reveal strong performance in sentiment analysis and certain structured tasks but persistent gaps in relation extraction, summarization, and stock-movement prediction, highlighting the need for targeted domain adaptation and better bilingual alignment. The work advances the field by offering a practical, extensible bilingual evaluation framework and releasing open-source datasets and tools to foster ongoing development of financial LLMs in multilingual settings.

Abstract

As large language models (LLMs) increasingly permeate the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. Existing financial benchmarks often suffer from limited language and task coverage, low-quality datasets, and inadequate adaptability for LLM evaluation. To address these limitations, we introduce Golden Touchstone, a comprehensive bilingual benchmark for financial LLMs, encompassing eight core financial NLP tasks in both Chinese and English. Developed from extensive open-source data collection and industry-specific demands, this benchmark thoroughly assesses models' language understanding and generation capabilities. Through comparative analysis of major models such as GPT-4o, Llama3, FinGPT, and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-source Touchstone-GPT, a financial LLM trained through continual pre-training and instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks. This research provides a practical evaluation tool for financial LLMs and guides future development and optimization. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at https://github.com/IDEA-FinAI/Golden-Touchstone.

Paper Structure

This paper contains 18 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Financial large language models are designed to perform specialized tasks such as financial sentiment analysis, content analysis, stock movement prediction, and financial analyst level question answering by interpreting and processing structured instructions and various input data to generate precise outputs.
  • Figure 2: Financial NLP tasks are categorized along two dimensions: task types, divided into financial NLU (Natural Language Understanding) and financial NLG (Natural Language Generation), and language, categorized as English and Chinese. We organized the collected high-quality datasets along these axes.
  • Figure 3: Comparison of performance from the perspective of models. Each subplot represents the performance of a models on both English and Chinese tasks. The bars indicate the model's performance on each task, while the dashed gray line represents the average performance across all models for that task.
  • Figure 4: Comparison of performance from the perspective of tasks, illustrating average performance for English and Chinese tasks respectively.