FLAME: Financial Large-Language Model Assessment and Metrics Evaluation

Jiayu Guo; Yu Guo; Martha Li; Songtao Tan

FLAME: Financial Large-Language Model Assessment and Metrics Evaluation

Jiayu Guo, Yu Guo, Martha Li, Songtao Tan

TL;DR

FLAME introduces a two-benchmark framework for evaluating Chinese financial LLMs: FLAME-Cer, a financial qualification certification benchmark with ~16,000 manually reviewed questions across 14 certifications, and FLAME-Sce, a financial scenario application benchmark with >5,000 questions spanning 10 core, 21 secondary, and ~100 tertiary tasks. It defines a multi-dimensional scoring system with scenario-specific weights and a two-step evaluation process, enabling nuanced assessments from knowledge mastery to practical applicability. Experiments across six baselines reveal Baichuan4-Finance generally leading in most tasks, with detailed results and case studies illustrating strengths and limitations across knowledge, compliance, and real-world applications. The framework aims to advance professional, regulator-aligned, and industry-relevant evaluation of financial LLMs in Chinese contexts, supporting model development and deployment decisions with timely updates. The key methodological contributions include the hierarchical benchmark design, dimension-weighted scoring, and extensive, manually curated content for robust, domain-specific evaluation.

Abstract

LLMs have revolutionized NLP and demonstrated potential across diverse domains. More and more financial LLMs have been introduced for finance-specific tasks, yet comprehensively assessing their value is still challenging. In this paper, we introduce FLAME, a comprehensive financial LLMs evaluation system in Chinese, which includes two core evaluation benchmarks: FLAME-Cer and FLAME-Sce. FLAME-Cer covers 14 types of authoritative financial certifications, including CPA, CFA, and FRM, with a total of approximately 16,000 carefully selected questions. All questions have been manually reviewed to ensure accuracy and representativeness. FLAME-Sce consists of 10 primary core financial business scenarios, 21 secondary financial business scenarios, and a comprehensive evaluation set of nearly 100 tertiary financial application tasks. We evaluate 6 representative LLMs, including GPT-4o, GLM-4, ERNIE-4.0, Qwen2.5, XuanYuan3, and the latest Baichuan4-Finance, revealing Baichuan4-Finance excels other LLMs in most tasks. By establishing a comprehensive and professional evaluation system, FLAME facilitates the advancement of financial LLMs in Chinese contexts. Instructions for participating in the evaluation are available on GitHub: https://github.com/FLAME-ruc/FLAME.

FLAME: Financial Large-Language Model Assessment and Metrics Evaluation

TL;DR

Abstract

FLAME: Financial Large-Language Model Assessment and Metrics Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)