Table of Contents
Fetching ...

Evaluation of Table Representations to Answer Questions from Tables in Documents : A Case Study using 3GPP Specifications

Sujoy Roychowdhury, Sumit Soman, HG Ranjani, Avantika Sharma, Neeraj Gunda, Sai Krishna Bala

TL;DR

The paper addresses QA over documents with tables interspersed with text in technical standards, using a 3GPP Release 18 case study. It conducts a systematic, factorial evaluation of tabular representations across multiple publicly available embeddings, focusing on row-level versus table-level chunking, repeated headers, and column separators ($2^4=16$ configurations). Key findings show that interspersed text degrades retrieval, while row-level embeddings with repeated header information substantially improve performance, with pipe separators offering marginal gains. The work provides actionable guidance for building retrieval-augmented QA systems on technical documents and introduces a SME-curated dataset to benchmark tabular representations in domain-specific contexts.

Abstract

With the ubiquitous use of document corpora for question answering, one important aspect which is especially relevant for technical documents is the ability to extract information from tables which are interspersed with text. The major challenge in this is that unlike free-flow text or isolated set of tables, the representation of a table in terms of what is a relevant chunk is not obvious. We conduct a series of experiments examining various representations of tabular data interspersed with text to understand the relative benefits of different representations. We choose a corpus of $3^{rd}$ Generation Partnership Project (3GPP) documents since they are heavily interspersed with tables. We create expert curated dataset of question answers to evaluate our approach. We conclude that row level representations with corresponding table header information being included in every cell improves the performance of the retrieval, thus leveraging the structural information present in the tabular data.

Evaluation of Table Representations to Answer Questions from Tables in Documents : A Case Study using 3GPP Specifications

TL;DR

The paper addresses QA over documents with tables interspersed with text in technical standards, using a 3GPP Release 18 case study. It conducts a systematic, factorial evaluation of tabular representations across multiple publicly available embeddings, focusing on row-level versus table-level chunking, repeated headers, and column separators ( configurations). Key findings show that interspersed text degrades retrieval, while row-level embeddings with repeated header information substantially improve performance, with pipe separators offering marginal gains. The work provides actionable guidance for building retrieval-augmented QA systems on technical documents and introduces a SME-curated dataset to benchmark tabular representations in domain-specific contexts.

Abstract

With the ubiquitous use of document corpora for question answering, one important aspect which is especially relevant for technical documents is the ability to extract information from tables which are interspersed with text. The major challenge in this is that unlike free-flow text or isolated set of tables, the representation of a table in terms of what is a relevant chunk is not obvious. We conduct a series of experiments examining various representations of tabular data interspersed with text to understand the relative benefits of different representations. We choose a corpus of Generation Partnership Project (3GPP) documents since they are heavily interspersed with tables. We create expert curated dataset of question answers to evaluate our approach. We conclude that row level representations with corresponding table header information being included in every cell improves the performance of the retrieval, thus leveraging the structural information present in the tabular data.
Paper Structure (7 sections, 4 figures, 2 tables)

This paper contains 7 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A sample table and its representations used for our experiments.
  • Figure 2: Histogram for number of rows (excluding header) in table across corpus
  • Figure 3: Comparison of top-5 retrieval accuracy (%)different representations for tables. The caption given on the bottom left subplot is common for all subplots.
  • Figure 4: Performance for the best performing representation (row level chunk, repeated header, pipe separator) with and without interspersed text.