Is This a Bad Table? A Closer Look at the Evaluation of Table Generation from Text

Pritika Ramu; Aparna Garimella; Sambaran Bandyopadhyay

Is This a Bad Table? A Closer Look at the Evaluation of Table Generation from Text

Pritika Ramu, Aparna Garimella, Sambaran Bandyopadhyay

TL;DR

This work proposes TabEval, a novel table evaluation strategy that captures table semantics by first breaking down a table into a list of natural language atomic statements and then compares them with ground truth statements using entailment-based measures.

Abstract

Understanding whether a generated table is of good quality is important to be able to use it in creating or editing documents using automatic methods. In this work, we underline that existing measures for table quality evaluation fail to capture the overall semantics of the tables, and sometimes unfairly penalize good tables and reward bad ones. We propose TabEval, a novel table evaluation strategy that captures table semantics by first breaking down a table into a list of natural language atomic statements and then compares them with ground truth statements using entailment-based measures. To validate our approach, we curate a dataset comprising of text descriptions for 1,250 diverse Wikipedia tables, covering a range of topics and structures, in contrast to the limited scope of existing datasets. We compare TabEval with existing metrics using unsupervised and supervised text-to-table generation methods, demonstrating its stronger correlation with human judgments of table quality across four datasets.

Is This a Bad Table? A Closer Look at the Evaluation of Table Generation from Text

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 8 figures, 3 tables)

This paper contains 16 sections, 2 equations, 8 figures, 3 tables.

Introduction
Proposed Evaluation Strategy
Table Unrolling.
Entailment-based Scoring.
Dataset Curation
Experiments
Results & Discussion
Limitations
TabUnroll Prompt Template
DescToTTo Samples
Sample 1
Sample 2
Sample 3
Text-to-Table Prompt
Human Survey
...and 1 more sections

Figures (8)

Figure 1: Tables are unrolled using TalUnroll prompting with an LLM, and the obtained statements are evaluated using NLI.
Figure 2: Sample generated tables with precision (P), recall (R), and F1 using TabEval with GPT-4 and BertScore-based (BS). BS penalises tables for variation in column headers. Table A, despite having correct details, scores lower with BS but high with ours. Table B, with errors, is appropriately penalized by TabEval. Table C covers all the details from reference table, receives lower precision and recall with BS but high scores with ours. Table D, missing some rows, has reduced recall with TabEval.
Figure 3: Screenshot of file given to raters for evaluation.
Figure 4: Screenshot of Microsoft Forms used for survey.
Figure 5: Screenshot of the annotation for atomicity and meaningfulness.
...and 3 more figures

Is This a Bad Table? A Closer Look at the Evaluation of Table Generation from Text

TL;DR

Abstract

Is This a Bad Table? A Closer Look at the Evaluation of Table Generation from Text

Authors

TL;DR

Abstract

Table of Contents

Figures (8)