Dólares or Dollars? Unraveling the Bilingual Prowess of Financial LLMs Between Spanish and English
Xiao Zhang, Ruoyu Xiang, Chenhan Yuan, Duanyu Feng, Weiguang Han, Alejandro Lopez-Lira, Xiao-Yang Liu, Sophia Ananiadou, Min Peng, Jimin Huang, Qianqian Xie
TL;DR
This work addresses the gap in Spanish financial NLP by introducing Toisón de Oro, a bilingual framework that provides FIT-ES (Spanish-English instruction-tuning data), FinMA-ES (a bilingual financial LLM), and FLARE-ES (a bilingual benchmark). The approach leverages instruction tuning on bilingual data to train FinMA-ES (based on LLaMA2-7B) and a Spanish-only variant, with FLARE-ES spanning 21 datasets across 9 tasks to assess cross-lingual performance. Experiments show FinMA-ES often outperforms baselines such as GPT-4 on Spanish financial tasks and demonstrates meaningful cross-lingual transfer, highlighting the value of bilingual data for both languages. The work releases all datasets, models, and benchmarks to enable broader research and practical bilingual financial NLP applications, underscoring the potential for more inclusive and effective multilingual finance analytics.
Abstract
Despite Spanish's pivotal role in the global finance industry, a pronounced gap exists in Spanish financial natural language processing (NLP) and application studies compared to English, especially in the era of large language models (LLMs). To bridge this gap, we unveil Toisón de Oro, the first bilingual framework that establishes instruction datasets, finetuned LLMs, and evaluation benchmark for financial LLMs in Spanish joint with English. We construct a rigorously curated bilingual instruction dataset including over 144K Spanish and English samples from 15 datasets covering 7 tasks. Harnessing this, we introduce FinMA-ES, an LLM designed for bilingual financial applications. We evaluate our model and existing LLMs using FLARE-ES, the first comprehensive bilingual evaluation benchmark with 21 datasets covering 9 tasks. The FLARE-ES benchmark results reveal a significant multilingual performance gap and bias in existing LLMs. FinMA-ES models surpass SOTA LLMs such as GPT-4 in Spanish financial tasks, due to strategic instruction tuning and leveraging data from diverse linguistic resources, highlighting the positive impact of cross-linguistic transfer. All our datasets, models, and benchmarks have been released.
