Table of Contents
Fetching ...

Improved LLM Agents for Financial Document Question Answering

Nelvin Tan, Zian Seng, Liang Zhang, Yu-Ching Shih, Dong Yang, Amol Salunkhe

TL;DR

This paper investigates numerical question answering over financial documents that combine tables and text, challenging the idea that a critic agent alone (without oracle labels) reliably improves results. Building on a prior multi-agent framework, it introduces an improved critic agent and a calculator agent to enhance accuracy and safety without fine-tuning. Experiments on two financial QA datasets (TATQA and FinQA) across two LLMs (llama3-70B and GPT4-turbo) show that the calculator agent often outperforms the previous state-of-the-art PoT approach, while the traditional critic without oracle has limited or mixed benefits. The study also analyzes inter-agent interactions, highlighting the calculator’s pivotal role in achieving robust numerical reasoning in financial contexts, with results sensitive to the LLM used. These findings suggest a safer, more scalable path for intrinsic numerical reasoning in LLMs applied to finance.

Abstract

Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent's performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.

Improved LLM Agents for Financial Document Question Answering

TL;DR

This paper investigates numerical question answering over financial documents that combine tables and text, challenging the idea that a critic agent alone (without oracle labels) reliably improves results. Building on a prior multi-agent framework, it introduces an improved critic agent and a calculator agent to enhance accuracy and safety without fine-tuning. Experiments on two financial QA datasets (TATQA and FinQA) across two LLMs (llama3-70B and GPT4-turbo) show that the calculator agent often outperforms the previous state-of-the-art PoT approach, while the traditional critic without oracle has limited or mixed benefits. The study also analyzes inter-agent interactions, highlighting the calculator’s pivotal role in achieving robust numerical reasoning in financial contexts, with results sensitive to the LLM used. These findings suggest a safer, more scalable path for intrinsic numerical reasoning in LLMs applied to finance.

Abstract

Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent's performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.

Paper Structure

This paper contains 31 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Visualization of CoT and coder approach via the analyst agent. Flow: User feeds the table, text, and question to the analyst agent which processes the input (using CoT or PoT) and returns the final answer.
  • Figure 2: Visualization of the critic and analyst agents. Flow: User feeds the table, text, and question to the analyst agent which processes the input, and produces a CoT answer that is passed to the critic agent. The critic agent produces a critique and passes it to the analyst agent, which processes the CoT answer and the critique before returning the final answer.
  • Figure 3: Visualization of the calculator and analyst agents. Flow: User feeds the table, text, and question to the analyst agent which processes the input, and produces a CoT answer that is passed to the calculator agent. The calculator agent produces an answer and passes it to the analyst agent, which processes the CoT answer and the calculator agent's answer before returning the final answer.
  • Figure 4: Visualization of the critic, calculator, and analyst agents. Flow: User feeds the table, text, and question to the analyst agent which processes the input, and produces a CoT answer that is passed to the critic agent. The critic agent and analyst agent interacts (exactly like in Figure \ref{['fig:CoT_critic']}) to produce a refined answer. The refined answer is then sent to the calculator agent, and the calculator agent and analyst agent interacts (exactly like in Figure \ref{['fig:CoT_cal']}) to produce a more precise answer -- which will be the final answer sent to the user.
  • Figure 5: Analysis of the changes in the correctness of answers by the critic agent. Pie charts on the left are for TATQA and pie charts on the right are for FinQA.