Table of Contents
Fetching ...

FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering

Siqiao Xue, Xiaojing Li, Fan Zhou, Qingyang Dai, Zhixuan Chu, Hongyuan Mei

TL;DR

FAMMA introduces a multilingual, multimodal benchmark for financial QA to systematically evaluate LLMs on advanced financial reasoning. It provides two main datasets, FAMMA-Basic (1,945 questions) and FAMMA-LivePro (103 expert-authored questions), plus OCR-based text-only variants and a collection of 1,273 reasoning traces to enable supervised reasoning training. Experiments reveal that even frontier models struggle on knowledge-heavy finance tasks, though Python-based reasoning and larger-scale fine-tuning yield meaningful gains, while retrieval augmentation shows limited benefits for reasoning tasks. The benchmark is open-source with a public leaderboard, enabling broader community evaluation and fostering progress in finance-domain AI systems.

Abstract

In this paper, we introduce FAMMA, an open-source benchmark for \underline{f}in\underline{a}ncial \underline{m}ultilingual \underline{m}ultimodal question \underline{a}nswering (QA). Our benchmark aims to evaluate the abilities of large language models (LLMs) in answering complex reasoning questions that require advanced financial knowledge. The benchmark has two versions: FAMMA-Basic consists of 1,945 questions extracted from university textbooks and exams, along with human-annotated answers and rationales; FAMMA-LivePro consists of 103 novel questions created by human domain experts, with answers and rationales held out from the public for a contamination-free evaluation. These questions cover advanced knowledge of 8 major subfields in finance (e.g., corporate finance, derivatives, and portfolio management). Some are in Chinese or French, while a majority of them are in English. Each question has some non-text data such as charts, diagrams, or tables. Our experiments reveal that FAMMA poses a significant challenge on LLMs, including reasoning models such as GPT-o1 and DeepSeek-R1. Additionally, we curated 1,270 reasoning trajectories of DeepSeek-R1 on the FAMMA-Basic data, and fine-tuned a series of open-source Qwen models using this reasoning data. We found that training a model on these reasoning trajectories can significantly improve its performance on FAMMA-LivePro. We released our leaderboard, data, code, and trained models at https://famma-bench.github.io/famma/.

FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering

TL;DR

FAMMA introduces a multilingual, multimodal benchmark for financial QA to systematically evaluate LLMs on advanced financial reasoning. It provides two main datasets, FAMMA-Basic (1,945 questions) and FAMMA-LivePro (103 expert-authored questions), plus OCR-based text-only variants and a collection of 1,273 reasoning traces to enable supervised reasoning training. Experiments reveal that even frontier models struggle on knowledge-heavy finance tasks, though Python-based reasoning and larger-scale fine-tuning yield meaningful gains, while retrieval augmentation shows limited benefits for reasoning tasks. The benchmark is open-source with a public leaderboard, enabling broader community evaluation and fostering progress in finance-domain AI systems.

Abstract

In this paper, we introduce FAMMA, an open-source benchmark for \underline{f}in\underline{a}ncial \underline{m}ultilingual \underline{m}ultimodal question \underline{a}nswering (QA). Our benchmark aims to evaluate the abilities of large language models (LLMs) in answering complex reasoning questions that require advanced financial knowledge. The benchmark has two versions: FAMMA-Basic consists of 1,945 questions extracted from university textbooks and exams, along with human-annotated answers and rationales; FAMMA-LivePro consists of 103 novel questions created by human domain experts, with answers and rationales held out from the public for a contamination-free evaluation. These questions cover advanced knowledge of 8 major subfields in finance (e.g., corporate finance, derivatives, and portfolio management). Some are in Chinese or French, while a majority of them are in English. Each question has some non-text data such as charts, diagrams, or tables. Our experiments reveal that FAMMA poses a significant challenge on LLMs, including reasoning models such as GPT-o1 and DeepSeek-R1. Additionally, we curated 1,270 reasoning trajectories of DeepSeek-R1 on the FAMMA-Basic data, and fine-tuned a series of open-source Qwen models using this reasoning data. We found that training a model on these reasoning trajectories can significantly improve its performance on FAMMA-LivePro. We released our leaderboard, data, code, and trained models at https://famma-bench.github.io/famma/.
Paper Structure (40 sections, 12 figures, 4 tables)

This paper contains 40 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Questions in FAMMA, requiring highly specialized knowledge and sophisticated calculation.
  • Figure 2: Performance of evaluated models on FAMMA-Basic. Each triplet of bars shows Pass@1 accuracy for the arithmetic subset (orange), the non-arithmetic subset (green), and their weighted consolidated score (blue).
  • Figure 3: Performance of evaluated models on FAMMA-LivePro. Each bar group shows the same three metrics as in \ref{['fig:leaderboard_basic']}.
  • Figure 4: Pass@1 accuracy ($\%$) breakdown by subfields on FAMMA-Basic; the legend is identical to that in \ref{['fig:subfield_pro']}.
  • Figure 5: Pass@1 accuracy ($\%$) breakdown by subfields on FAMMA-LivePro.
  • ...and 7 more figures