Table of Contents
Fetching ...

Characterizing Multimodal Long-form Summarization: A Case Study on Financial Reports

Tianyu Cao, Natraj Raman, Danial Dervovic, Chenhao Tan

TL;DR

The paper tackles the problem of understanding how large language models handle multimodal, long-form summarization in financial reports, using 1,000 random 10-K MD&A documents. It introduces a computational framework to analyze extractiveness, information sourcing, and numeric usage across models (Claude 2.0/2.1, GPT-4/3.5, and Cohere), and then provides a detailed empirical study including a taxonomy of numeric hallucinations and prompt-engineering strategies. Key findings show substantial extractive content (30–40%), a strong position bias—varying by model—and Claude 2’s superior performance in leveraging tabular numbers compared with GPT-4, alongside a low overall hallucination rate that can be further mitigated with CoT prompts for GPT-4. The work contributes to methodological tools for evaluating long-form multimodal summarization and offers practical insights for improving numeric fidelity and content sourcing in real-world applications.

Abstract

As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports are not only long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Cohere. We find that GPT-3.5 and Cohere fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude seems to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4. The generated summaries and evaluation code are available at https://github.com/ChicagoHAI/characterizing-multimodal-long-form-summarization.

Characterizing Multimodal Long-form Summarization: A Case Study on Financial Reports

TL;DR

The paper tackles the problem of understanding how large language models handle multimodal, long-form summarization in financial reports, using 1,000 random 10-K MD&A documents. It introduces a computational framework to analyze extractiveness, information sourcing, and numeric usage across models (Claude 2.0/2.1, GPT-4/3.5, and Cohere), and then provides a detailed empirical study including a taxonomy of numeric hallucinations and prompt-engineering strategies. Key findings show substantial extractive content (30–40%), a strong position bias—varying by model—and Claude 2’s superior performance in leveraging tabular numbers compared with GPT-4, alongside a low overall hallucination rate that can be further mitigated with CoT prompts for GPT-4. The work contributes to methodological tools for evaluating long-form multimodal summarization and offers practical insights for improving numeric fidelity and content sourcing in real-world applications.

Abstract

As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports are not only long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Cohere. We find that GPT-3.5 and Cohere fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude seems to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4. The generated summaries and evaluation code are available at https://github.com/ChicagoHAI/characterizing-multimodal-long-form-summarization.
Paper Structure (26 sections, 1 equation, 5 figures, 8 tables)

This paper contains 26 sections, 1 equation, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Source distribution of extractive summary sentences. OR and SR stand for original report and shuffled report respectively, while OS and SS stand for summaries of the original report and shuffled report respectively. For the summary generated from the original report, most information comes from the beginning of the report. However, for the shuffled reports, this trend disappears for Claude but stays for GPT-4.
  • Figure 2: The summary sentence presents a rounded figure with the unit adjusted as per the remark preceding the table. The input tables are in HTML format as shown in the figure's bottom section.
  • Figure 3: Hallucinated numbers frequency of summaries generated by simple prompt and CoT prompt respectively.
  • Figure 4: Source location distribution of 2-1 synthesizing sentences.
  • Figure 5: Greedy match algorithm for similarity score calculation.