Impact of Model Size on Fine-tuned LLM Performance in Data-to-Text Generation: A State-of-the-Art Investigation

Joy Mahapatra; Utpal Garain

Impact of Model Size on Fine-tuned LLM Performance in Data-to-Text Generation: A State-of-the-Art Investigation

Joy Mahapatra, Utpal Garain

TL;DR

Investigation of fine-tuned LLMs in D2T tasks in terms of model size reveals that increasing LLM size enhances readability andformativeness in D2T tasks, but larger LLMs may sacrifice \textit{faithfulness}.

Abstract

Data-to-text (D2T) generation aims to generate human-readable text from semi-structured data, such as tables and graphs. The recent success of D2T is largely attributed to advancements in LLMs. Despite the success of LLMs, no research has been conducted to illustrate the impact of model size on the performance of fine-tuned LLMs for D2T tasks. D2T model performance is typically assessed based on three key qualities: \textit{readability} (indicates fluency and coherence), \textit{informativeness} (measures content similarity), and \textit{faithfulness} (assesses consistency of factual information). It is currently uncertain whether increasing the size of LLMs effectively improves performance in D2T tasks across these three qualities. The objective of this study is to investigate the performance of fine-tuned LLMs in D2T tasks in terms of model size. Through extensive comparative analysis, we aim to elucidate both the advantages and limitations of scaling model sizes across five widely used D2T datasets (E2E, ViGGo, WikiTableText, DART, and WebNLG) and twelve state-of-the-art LLMs with varying sizes from five different LLM families (T5, BART, OPT, BLOOM, and Llama 2). To comprehensively cover all the three essential qualities of D2T models, we incorporate six widely recognized automatic metrics -- \textsc{BLEU}, \textsc{METEOR}, \textsc{BERTScore}, \textsc{MoverScore}, \textsc{Parent}, and \textsc{BARTScore}. We also provide an in-depth analysis of LLM performance concerning model size in the presence of source-reference divergence, a critical aspect of D2T tasks. Our investigation reveals that increasing LLM size enhances \textit{readability} and \textit{informativeness} in D2T tasks, but larger (in terms of size) LLMs may sacrifice \textit{faithfulness}. Moreover, small-sized LLMs show more resilience than larger ones when source-reference divergence is present.

Impact of Model Size on Fine-tuned LLM Performance in Data-to-Text Generation: A State-of-the-Art Investigation

TL;DR

Abstract

Paper Structure (37 sections, 6 equations, 11 figures, 4 tables)

This paper contains 37 sections, 6 equations, 11 figures, 4 tables.

Introduction
Research Questions and Motivations
Related Work
Preliminaries
Conditional Text Generation
(Large) Language Model
Data-to-Text (D2T) Generation
Source-Reference Divergence
Models, Datasets and Experimental Settings
Models
BART.
T5.
BLOOM.
OPT.
Llama 2.
...and 22 more sections

Figures (11)

Figure 1: Overview of data-to-text (D2T) generation with three major types: graph-to-text (left), table-to-text (middle), MR (meaning representation)-to-text (right).
Figure 2: Three key qualities to assess the performance of a D2T model are: readability (focusing on fluency and coherence), informativeness (evaluating the ability to generate essential content), and faithfulness (indicating the consistency of the generated text by measuring the presence of irrelevant facts).
Figure 3: Two prevalent types of language models based on transformer vaswani2017attention architecture are depicted here: bidirectional and unidirectional language models. The red lines represent attention mechanisms.
Figure 4: Two of the most popular architectures used for implementing language models: the encoder-decoder architecture (left) and the decoder-only architecture (right). The sequence $e_1e_2\dots e_m$ represents the source input of size $m$ (number of words), while the corresponding textual output is denoted by $w_1w_2\dots w_n$ of size $n$.
Figure 5: An example of source-reference divergence, taken from the WikiTableText bao2018table dataset. The reference text includes two additional facts (bolded and underlined) - 'scrolling performer' and 'fun game' - that are absent in the corresponding source data.
...and 6 more figures

Impact of Model Size on Fine-tuned LLM Performance in Data-to-Text Generation: A State-of-the-Art Investigation

TL;DR

Abstract

Impact of Model Size on Fine-tuned LLM Performance in Data-to-Text Generation: A State-of-the-Art Investigation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)