Table of Contents
Fetching ...

TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data

Fengbin Zhu, Ziyang Liu, Fuli Feng, Chao Wang, Moxin Li, Tat-Seng Chua

TL;DR

This work targets question answering over documents containing both tables and text, a setting that requires robust discrete reasoning. It proposes a Step-wise Pipeline—Extractor, Reasoner, and Executor—to structure reasoning and introduces TAT-LLM, a fine-tuned, open-source LLaMA 2–based model, augmented with an External Executor to ensure precise execution of arithmetic and logical rules. Trained on automatically generated data from FinQA, TAT-QA, and TAT-DQA via LoRA fine-tuning, TAT-LLM achieves state-of-the-art results among non-GPT-4 baselines and even surpasses GPT-4 on all three benchmarks. The results demonstrate that task-specific specialization of smaller LLMs, combined with modular execution, can yield strong performance while addressing cost and privacy concerns in real-world tabular-text QA tasks.

Abstract

In this work, we address question answering (QA) over a hybrid of tabular and textual data that are very common content on the Web (e.g. SEC filings), where discrete reasoning capabilities are often required. Recently, large language models (LLMs) like GPT-4 have demonstrated strong multi-step reasoning capabilities. We then consider harnessing the amazing power of LLMs to solve our task. We abstract a Step-wise Pipeline for tabular and textual QA, which consists of three key steps, including Extractor, Reasoner and Executor, and initially design an instruction to instantiate the pipeline and validate that GPT-4 outperforms all existing methods. However, utilizing an online LLM like GPT-4 holds various challenges in terms of cost, latency, and data security risk, which motivates us to specialize smaller LLMs in this task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets following the Step-wise Pipeline. The experimental results have verified that our TAT-LLM model can outperform all baseline models, including the previous best fine-tuned models and very large-scale LLMs like GPT-4 on FinQA, TAT-QA and TAT-DQA benchmarks.

TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data

TL;DR

This work targets question answering over documents containing both tables and text, a setting that requires robust discrete reasoning. It proposes a Step-wise Pipeline—Extractor, Reasoner, and Executor—to structure reasoning and introduces TAT-LLM, a fine-tuned, open-source LLaMA 2–based model, augmented with an External Executor to ensure precise execution of arithmetic and logical rules. Trained on automatically generated data from FinQA, TAT-QA, and TAT-DQA via LoRA fine-tuning, TAT-LLM achieves state-of-the-art results among non-GPT-4 baselines and even surpasses GPT-4 on all three benchmarks. The results demonstrate that task-specific specialization of smaller LLMs, combined with modular execution, can yield strong performance while addressing cost and privacy concerns in real-world tabular-text QA tasks.

Abstract

In this work, we address question answering (QA) over a hybrid of tabular and textual data that are very common content on the Web (e.g. SEC filings), where discrete reasoning capabilities are often required. Recently, large language models (LLMs) like GPT-4 have demonstrated strong multi-step reasoning capabilities. We then consider harnessing the amazing power of LLMs to solve our task. We abstract a Step-wise Pipeline for tabular and textual QA, which consists of three key steps, including Extractor, Reasoner and Executor, and initially design an instruction to instantiate the pipeline and validate that GPT-4 outperforms all existing methods. However, utilizing an online LLM like GPT-4 holds various challenges in terms of cost, latency, and data security risk, which motivates us to specialize smaller LLMs in this task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets following the Step-wise Pipeline. The experimental results have verified that our TAT-LLM model can outperform all baseline models, including the previous best fine-tuned models and very large-scale LLMs like GPT-4 on FinQA, TAT-QA and TAT-DQA benchmarks.
Paper Structure (26 sections, 5 figures, 15 tables, 1 algorithm)

This paper contains 26 sections, 5 figures, 15 tables, 1 algorithm.

Figures (5)

  • Figure 1: Examples of QA with discrete reasoning over a hybrid of tabular and textual data.
  • Figure 2: Comparison between a) End-to-end Pipeline and b) Step-wise Pipeline. c) Our TAT-LLM language model is developed by fine-tuning LLaMA 2 following the Step-wise Pipeline.
  • Figure 3: Comparison of different training strategies.
  • Figure 4: Performance comparison in terms of EM between TAT-LLM (7B) and LLaMA 2-Chat (7B) for different question types on TAT-QA.
  • Figure 5: Performance comparison in terms of EM between TAT-LLM (7B) and LLaMA 2-Chat (7B) for different question types on TAT-DQA.