Table of Contents
Fetching ...

TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools

Avi Caciularu, Alon Jacovi, Eyal Ben-David, Sasha Goldshtein, Tal Schuster, Jonathan Herzig, Gal Elidan, Amir Globerson

TL;DR

TACT addresses the challenge that LLMs struggle to aggregate information across texts to produce numerical answers. It introduces a benchmark built atop InstructIE, adding expert-crafted instructions, Pandas-style commands, and gold answers to enable end-to-end text-to-table reasoning. The authors propose IE as a Tool, a two-step pipeline that first constructs a table and then formulates a Pandas query, achieving up to 12% improvements over standard prompting, especially for larger models. This work delivers a focused dataset, a decomposition-based modeling approach, and practical insights for enhancing complex reasoning with LLMs in scenarios requiring information integration across documents.

Abstract

Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts. To better evaluate this setting and facilitate modeling efforts, we introduce TACT - Text And Calculations through Tables, a dataset crafted to evaluate LLMs' reasoning and computational abilities using complex instructions. TACT contains challenging instructions that demand stitching information scattered across one or more texts, and performing complex integration on this information to generate the answer. We construct this dataset by leveraging an existing dataset of texts and their associated tables. For each such tables, we formulate new queries, and gather their respective answers. We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%. To pinpoint the difficulties and thoroughly dissect the problem, we analyze model performance across three components: table-generation, Pandas command-generation, and execution. Unexpectedly, we discover that each component presents substantial challenges for current LLMs. These insights lead us to propose a focused modeling framework, which we refer to as IE as a tool. Specifically, we propose to add "tools" for each of the above steps, and implement each such tool with few-shot prompting. This approach shows an improvement over existing prompting techniques, offering a promising direction for enhancing model capabilities in these tasks.

TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools

TL;DR

TACT addresses the challenge that LLMs struggle to aggregate information across texts to produce numerical answers. It introduces a benchmark built atop InstructIE, adding expert-crafted instructions, Pandas-style commands, and gold answers to enable end-to-end text-to-table reasoning. The authors propose IE as a Tool, a two-step pipeline that first constructs a table and then formulates a Pandas query, achieving up to 12% improvements over standard prompting, especially for larger models. This work delivers a focused dataset, a decomposition-based modeling approach, and practical insights for enhancing complex reasoning with LLMs in scenarios requiring information integration across documents.

Abstract

Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts. To better evaluate this setting and facilitate modeling efforts, we introduce TACT - Text And Calculations through Tables, a dataset crafted to evaluate LLMs' reasoning and computational abilities using complex instructions. TACT contains challenging instructions that demand stitching information scattered across one or more texts, and performing complex integration on this information to generate the answer. We construct this dataset by leveraging an existing dataset of texts and their associated tables. For each such tables, we formulate new queries, and gather their respective answers. We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%. To pinpoint the difficulties and thoroughly dissect the problem, we analyze model performance across three components: table-generation, Pandas command-generation, and execution. Unexpectedly, we discover that each component presents substantial challenges for current LLMs. These insights lead us to propose a focused modeling framework, which we refer to as IE as a tool. Specifically, we propose to add "tools" for each of the above steps, and implement each such tool with few-shot prompting. This approach shows an improvement over existing prompting techniques, offering a promising direction for enhancing model capabilities in these tasks.
Paper Structure (25 sections, 11 figures, 4 tables)

This paper contains 25 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Annotated components of the TACT dataset. The answer is concise but demands advanced reasoning. Intermediate artifacts aid in analyzing LLM reasoning and designing the IE as a tool method. Relevant spans are underlined.
  • Figure 2: The TACT Dataset pandas different tokens' distribution.
  • Figure 3: TACT's Pandas commands' length vs. the total number of cells in their corresponding tables.
  • Figure 4: Possible setups for solving TACT with LLMs. (a) Typical Approach: the large language model (LLM) directly generates the answer based on the provided query and text, but without the aid of any external tools. (b) With IE as a Tool: This approach utilizes a three-step process. First, an information extraction tool generates a structured table from the text and query. Next, another tool formulates an appropriate Pandas command based on this table, the text and the query. Finally, all this information is fed into the LLM, which then generates the answer; or into a code interpreter that can run the Pandas command over the table.
  • Figure 5: The data creation guidelines for the TACT Dataset. This figure presents a comprehensive set of guidelines designed to assist annotators in extracting and computing numerical aspects from provided text and table data using the Pandas library. The guidelines include steps for reviewing and assessing the relevance of data, identifying numerical aspects, formulating natural language instructions and queries, translating these into precise Pandas commands, and validating the results. An example is provided to demonstrate the process, from text and table review to the execution and verification of the computed result.
  • ...and 6 more figures