Table of Contents
Fetching ...

UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models

Yuzhe Yang, Yifei Zhang, Yan Hu, Yilin Guo, Ruoli Gan, Yueru He, Mingcong Lei, Xiao Zhang, Haining Wang, Qianqian Xie, Jimin Huang, Honghai Yu, Benyou Wang

TL;DR

The UCFE benchmark proposes a user-centric framework to evaluate LLMs on real-world financial tasks by combining dynamic, multi-turn interactions with human expert judgments. It builds a dataset from a large-scale user survey (804 participants) and 17 task types (330 data points) to assess 11 LLM services using an LLM-as-Judge approach, achieving strong alignment with human preferences ($r = 0.78$). The evaluation employs Elo-based model comparisons, cross-checks with multiple evaluators, and a detailed analysis of case studies, demonstrating that domain-specialized, mid-sized models can outperform backbone models while maintaining efficiency. This framework advances practical AI deployment in finance by prioritizing user needs, explainability, and robust human-AI alignment beyond traditional task-specific metrics.

Abstract

This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 11 LLMs services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial domain but also provides a robust framework for assessing their performance and user satisfaction.

UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models

TL;DR

The UCFE benchmark proposes a user-centric framework to evaluate LLMs on real-world financial tasks by combining dynamic, multi-turn interactions with human expert judgments. It builds a dataset from a large-scale user survey (804 participants) and 17 task types (330 data points) to assess 11 LLM services using an LLM-as-Judge approach, achieving strong alignment with human preferences (). The evaluation employs Elo-based model comparisons, cross-checks with multiple evaluators, and a detailed analysis of case studies, demonstrating that domain-specialized, mid-sized models can outperform backbone models while maintaining efficiency. This framework advances practical AI deployment in finance by prioritizing user needs, explainability, and robust human-AI alignment beyond traditional task-specific metrics.

Abstract

This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 11 LLMs services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial domain but also provides a robust framework for assessing their performance and user satisfaction.

Paper Structure

This paper contains 32 sections, 3 equations, 24 figures, 5 tables.

Figures (24)

  • Figure 1: Overview framework of the UCFE Benchmark.
  • Figure 2: The visualization displays the top 25 most common root verbs (inner circle) and their top 4 associated direct noun objects (outer circle) extracted from the provided texts.
  • Figure 3: Distribution of test and evaluation input lengths for the datasets.
  • Figure 4: The evaluation pipeline of the UCFE Benchmark involves the following steps: ① selecting the model and task, ② generating dialogues between the user and AI assistant via a user simulator, ③ creating evaluation prompts based on source information to assess model performance, ④ pairwise comparison of dialogue outputs by evaluators, aligned with human expert judgments, and ⑤ computing Elo scores based on win-loss outcomes.
  • Figure 5: Comparison of model performance on UCFE benchmark across three evaluators.
  • ...and 19 more figures