Table of Contents
Fetching ...

FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

Yitao Long, Tiansheng Hu, Yilun Zhao, Arman Cohan, Chen Zhao

TL;DR

FinLFQA tackles hallucination in financial long-form QA by introducing a multi-faceted attribution benchmark and an automated, fine-grained evaluation framework. It formalizes the task with clause-level statements carrying evidence, intermediate Python reasoning, and domain-knowledge links, evaluated across eight LLMs through three generation paradigms. The results show that end-to-end generation matches post-hoc performance, iterative refinement yields limited gains without external feedback, and domain-specific guidance significantly enhances outcomes. The benchmark demonstrates the importance of precise numerical reasoning and knowledge grounding for finance, with GPT-4o achieving the strongest overall performance but open-source models approaching parity in several dimensions. Limitations include a two-company setup and a call for broader data to further stress-test attribution in high-stakes financial QA.

Abstract

Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.

FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

TL;DR

FinLFQA tackles hallucination in financial long-form QA by introducing a multi-faceted attribution benchmark and an automated, fine-grained evaluation framework. It formalizes the task with clause-level statements carrying evidence, intermediate Python reasoning, and domain-knowledge links, evaluated across eight LLMs through three generation paradigms. The results show that end-to-end generation matches post-hoc performance, iterative refinement yields limited gains without external feedback, and domain-specific guidance significantly enhances outcomes. The benchmark demonstrates the importance of precise numerical reasoning and knowledge grounding for finance, with GPT-4o achieving the strongest overall performance but open-source models approaching parity in several dimensions. Limitations include a two-company setup and a call for broader data to further stress-test attribution in high-stakes financial QA.

Abstract

Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.

Paper Structure

This paper contains 37 sections, 1 equation, 8 figures, 8 tables.

Figures (8)

  • Figure 1: (Left) Compare to previous dataset gao-etal-2023-enabling on long form question answering with annotations, FinLFQA features clause level attribution, generation with knowledge retrieval and multi-faceted attribution. (Right) Overview of FinLFQA. The input consists of: (1) context—financial report paragraphs from two companies, (2) a question, and (3) a list of professional knowledge entries that may help in answering about the financial question. The outputs include: (a) an expert-written answer to the question by our annotators, and (b) clause-level attributions, which cover three aspects: Evidence (paragraph indices supporting the answer), Knowledge (entries from the provided knowledge list used), and Code (a Python snippet used to compute the numerical result when the answer involves calculations).
  • Figure 2: Overview of the three-stage process for FinLFQA construction. (1) Report Selection: We select company pairs based on their SIC codes and obtain their financial reports for the same fiscal quarter. (2) Question & Answer Annotation: We then identify key numerical content from both financial reports. Given those information, the annotators craft calculation-based questions requiring cross-company and multi-source reasoning, and providing detailed, step-by-step answers citing relevant paragraphs. (3) Attribution Annotation: Finance experts verify and split answers into evidence-backed clauses, annotate relevant professional financial concepts from a knowledge base, and translate verified calculations into structured Python functions for reproducibility and validation.
  • Figure 3: Prompt of post-hoc answer generation.
  • Figure 4: Prompt of post-hoc attribution generation.
  • Figure 5: Prompt of end-to-end generation.
  • ...and 3 more figures