Table of Contents
Fetching ...

MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application

Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Vincent Jim Zhang, Yuqing Guo, Jeff Zhao, Huan He, Yi Han, Yun Feng, Yuechen Jiang, Yupeng Cao, Haohang Li, Yangyang Yu, Xiaoyu Wang, Penglei Gao, Shengyuan Lin, Keyi Wang, Shanshan Yang, Yilun Zhao, Zhiwei Liu, Peng Lu, Jerry Huang, Suyuchen Wang, Triantafillos Papadopoulos, Polydoros Giannouris, Efstathia Soufleri, Nuo Chen, Zhiyang Deng, Heming Fu, Yijia Zhao, Mingquan Lin, Meikang Qiu, Kaleb E Smith, Arman Cohan, Xiao-Yang Liu, Jimin Huang, Guojun Xiong, Alejandro Lopez-Lira, Xi Chen, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou, Qianqian Xie

TL;DR

This work introduces MultiFinBen, the first multilingual and multimodal benchmark for financial LLMs, addressing critical gaps where real-world finance requires cross-language and cross-modal reasoning. It couples a difficulty-aware evaluation framework with two novel task families—PolyFiQA for multilingual financial QA and Financial OCR—for end-to-end document understanding, alongside integrated text, vision, and audio datasets. Evaluating 21 models shows substantial limitations: even frontier systems like GPT-4o achieve only 46.01% overall, with multilingual performance plummeting to single-digit levels on cross-language tasks, underscoring the need for dedicated multilingual and multimodal capabilities. All datasets, evaluation scripts, and leaderboards are openly released to foster transparent, reproducible, and progressive assessment of financial AI systems. The benchmark thereby enables targeted improvements toward realistic expert-level financial reasoning across languages and modalities, with broad societal implications and safeguards in place.

Abstract

Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.

MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application

TL;DR

This work introduces MultiFinBen, the first multilingual and multimodal benchmark for financial LLMs, addressing critical gaps where real-world finance requires cross-language and cross-modal reasoning. It couples a difficulty-aware evaluation framework with two novel task families—PolyFiQA for multilingual financial QA and Financial OCR—for end-to-end document understanding, alongside integrated text, vision, and audio datasets. Evaluating 21 models shows substantial limitations: even frontier systems like GPT-4o achieve only 46.01% overall, with multilingual performance plummeting to single-digit levels on cross-language tasks, underscoring the need for dedicated multilingual and multimodal capabilities. All datasets, evaluation scripts, and leaderboards are openly released to foster transparent, reproducible, and progressive assessment of financial AI systems. The benchmark thereby enables targeted improvements toward realistic expert-level financial reasoning across languages and modalities, with broad societal implications and safeguards in place.

Abstract

Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.

Paper Structure

This paper contains 80 sections, 4 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Overview of MultiFinBen100,20,185240,0,15.
  • Figure 2: Representation examples of PolyFiQA-Easy, EnglishOCR, and GreekOCR.
  • Figure 3: Performance across modalities: Text, Vision, Audio.
  • Figure 4: Performance across languages: EN, ZH, JA, ES, EL, BI, MU.
  • Figure 5: Performance across difficulty levels: Easy, Medium, Hard.
  • ...and 4 more figures