Improved LLM Agents for Financial Document Question Answering
Nelvin Tan, Zian Seng, Liang Zhang, Yu-Ching Shih, Dong Yang, Amol Salunkhe
TL;DR
This paper investigates numerical question answering over financial documents that combine tables and text, challenging the idea that a critic agent alone (without oracle labels) reliably improves results. Building on a prior multi-agent framework, it introduces an improved critic agent and a calculator agent to enhance accuracy and safety without fine-tuning. Experiments on two financial QA datasets (TATQA and FinQA) across two LLMs (llama3-70B and GPT4-turbo) show that the calculator agent often outperforms the previous state-of-the-art PoT approach, while the traditional critic without oracle has limited or mixed benefits. The study also analyzes inter-agent interactions, highlighting the calculator’s pivotal role in achieving robust numerical reasoning in financial contexts, with results sensitive to the LLM used. These findings suggest a safer, more scalable path for intrinsic numerical reasoning in LLMs applied to finance.
Abstract
Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent's performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.
