Table of Contents
Fetching ...

Extract, Match, and Score: An Evaluation Paradigm for Long Question-context-answer Triplets in Financial Analysis

Bo Hu, Han Yuan, Vlad Pandelea, Wuqiong Luo, Yingzhu Zhao, Zheng Ma

TL;DR

The paper tackles the challenge of evaluating long question-context-answer triplets in financial analysis, where traditional metrics falter. It introduces the EMS framework (Extract, Match, Score), a saliency-point-based pipeline that decomposes long responses into detailed claims, aligns reference and candidate points, and assigns soft alignment scores to produce EMS-Recall, EMS-Precision, and EMS-F1. Through a self-constructed financial QA dataset based on earnings call transcripts from the top ten S&P 500 constituents, the authors show that EMS—and its RAGChecker baseline—provide more nuanced, model-size-sensitive assessments than conventional metrics like BLEU, ROUGE, and BERTScore. The findings suggest EMS as a flexible, principled tool for assessing long-form financial analyses, with practical implications for evaluating and improving real-world LLM deployments in finance.

Abstract

The rapid advancement of large language models (LLMs) has sparked widespread adoption across diverse applications, making robust evaluation frameworks crucial for assessing their performance. While conventional evaluation metrics remain applicable for shorter texts, their efficacy diminishes when evaluating the quality of long-form answers. This limitation is particularly critical in real-world scenarios involving extended questions, extensive context, and long-form answers, such as financial analysis or regulatory compliance. In this paper, we use a practical financial use case to illustrate applications that handle "long question-context-answer triplets". We construct a real-world financial dataset comprising long triplets and demonstrate the inadequacies of traditional metrics. To address this, we propose an effective Extract, Match, and Score (EMS) evaluation approach tailored to the complexities of long-form LLMs' outputs, providing practitioners with a reliable methodology for assessing LLMs' performance in complex real-world scenarios.

Extract, Match, and Score: An Evaluation Paradigm for Long Question-context-answer Triplets in Financial Analysis

TL;DR

The paper tackles the challenge of evaluating long question-context-answer triplets in financial analysis, where traditional metrics falter. It introduces the EMS framework (Extract, Match, Score), a saliency-point-based pipeline that decomposes long responses into detailed claims, aligns reference and candidate points, and assigns soft alignment scores to produce EMS-Recall, EMS-Precision, and EMS-F1. Through a self-constructed financial QA dataset based on earnings call transcripts from the top ten S&P 500 constituents, the authors show that EMS—and its RAGChecker baseline—provide more nuanced, model-size-sensitive assessments than conventional metrics like BLEU, ROUGE, and BERTScore. The findings suggest EMS as a flexible, principled tool for assessing long-form financial analyses, with practical implications for evaluating and improving real-world LLM deployments in finance.

Abstract

The rapid advancement of large language models (LLMs) has sparked widespread adoption across diverse applications, making robust evaluation frameworks crucial for assessing their performance. While conventional evaluation metrics remain applicable for shorter texts, their efficacy diminishes when evaluating the quality of long-form answers. This limitation is particularly critical in real-world scenarios involving extended questions, extensive context, and long-form answers, such as financial analysis or regulatory compliance. In this paper, we use a practical financial use case to illustrate applications that handle "long question-context-answer triplets". We construct a real-world financial dataset comprising long triplets and demonstrate the inadequacies of traditional metrics. To address this, we propose an effective Extract, Match, and Score (EMS) evaluation approach tailored to the complexities of long-form LLMs' outputs, providing practitioners with a reliable methodology for assessing LLMs' performance in complex real-world scenarios.

Paper Structure

This paper contains 18 sections, 9 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: An example of saliency point extraction from long-form answer.
  • Figure 2: Illustration of matching and scoring procedure in EMS evaluation pipeline. Then EMS-Recall and EMS-Precision are computed by aggregating saliency point level scores.
  • Figure 3: The prompt used to generate the individual answers from GPT-4o and Mistral Large.
  • Figure 4: The prompt used to form the final answer by combining the answers from GPT-4o and Mistral Large.
  • Figure 5: The prompt used to extract saliency points from candidate answers.
  • ...and 2 more figures