Facts-and-Feelings: Capturing both Objectivity and Subjectivity in Table-to-Text Generation
Tathagata Dey, Pushpak Bhattacharyya
TL;DR
The paper addresses the challenge of generating natural language from tables while preserving subjectivity, introducing the Ta2TS dataset of 3849 instances across finance, weather, and sports and formalizing the problem with $T$ (table) to $S$ (text) via $P(S|T;\theta)$ and autoregressive decoding $s_i=\arg\max P(s_i|Y,s_1,...;\theta)$. It compares fine-tuned T5 sequence-to-sequence models on linearized tables with prompting of large language models (GPT-3.5-turbo, Mistral, Llama-2), using both automatic metrics (BLEU-4, METEOR, Rouge-L, BERTScore) and human evaluations of coherence, coverage, accuracy, and subjectivity capture. Key findings show that context-rich T5 models can approach GPT-3.5-turbo performance, while LLM prompting offers strong coverage and subjectivity control, with 3-shot prompts often performing best; Mistral-7B and Llama-2 underperform in this task. The work provides the first comprehensive, multi-genre benchmark for subjectivity-infused table-to-text generation and offers a baseline for future methods combining table encoders with generation components and expanded datasets.
Abstract
Table-to-text generation, a long-standing challenge in natural language generation, has remained unexplored through the lens of subjectivity. Subjectivity here encompasses the comprehension of information derived from the table that cannot be described solely by objective data. Given the absence of pre-existing datasets, we introduce the Ta2TS dataset with 3849 data instances. We perform the task of fine-tuning sequence-to-sequence models on the linearized tables and prompting on popular large language models. We analyze the results from a quantitative and qualitative perspective to ensure the capture of subjectivity and factual consistency. The analysis shows the fine-tuned LMs can perform close to the prompted LLMs. Both the models can capture the tabular data, generating texts with 85.15% BERTScore and 26.28% Meteor score. To the best of our knowledge, we provide the first-of-its-kind dataset on tables with multiple genres and subjectivity included and present the first comprehensive analysis and comparison of different LLM performances on this task.
