Table of Contents
Fetching ...

Knowledge-Centric Templatic Views of Documents

Isabel Cachola, Silviu Cucerzan, Allen Herring, Vuksan Mijovic, Erik Oveson, Sujay Kumar Jauhar

TL;DR

A novel unified evaluation framework is introduced that can be adapted to measuring the quality of document generators for heterogeneous downstream applications and is adaptable to a range of user defined criteria and application scenarios, obviating the need for task specific evaluation metrics.

Abstract

Authors seeking to communicate with broader audiences often share their ideas in various document formats, such as slide decks, newsletters, reports, and posters. Prior work on document generation has generally tackled the creation of each separate format to be a different task, leading to fragmented learning processes, redundancy in models and methods, and disjointed evaluation. We consider each of these documents as templatic views of the same underlying knowledge/content, and we aim to unify the generation and evaluation of these templatic views. We begin by showing that current LLMs are capable of generating various document formats with little to no supervision. Further, a simple augmentation involving a structured intermediate representation can improve performance, especially for smaller models. We then introduce a novel unified evaluation framework that can be adapted to measuring the quality of document generators for heterogeneous downstream applications. This evaluation is adaptable to a range of user defined criteria and application scenarios, obviating the need for task specific evaluation metrics. Finally, we conduct a human evaluation, which shows that people prefer 82% of the documents generated with our method, while correlating more highly with our unified evaluation framework than prior metrics in the literature.

Knowledge-Centric Templatic Views of Documents

TL;DR

A novel unified evaluation framework is introduced that can be adapted to measuring the quality of document generators for heterogeneous downstream applications and is adaptable to a range of user defined criteria and application scenarios, obviating the need for task specific evaluation metrics.

Abstract

Authors seeking to communicate with broader audiences often share their ideas in various document formats, such as slide decks, newsletters, reports, and posters. Prior work on document generation has generally tackled the creation of each separate format to be a different task, leading to fragmented learning processes, redundancy in models and methods, and disjointed evaluation. We consider each of these documents as templatic views of the same underlying knowledge/content, and we aim to unify the generation and evaluation of these templatic views. We begin by showing that current LLMs are capable of generating various document formats with little to no supervision. Further, a simple augmentation involving a structured intermediate representation can improve performance, especially for smaller models. We then introduce a novel unified evaluation framework that can be adapted to measuring the quality of document generators for heterogeneous downstream applications. This evaluation is adaptable to a range of user defined criteria and application scenarios, obviating the need for task specific evaluation metrics. Finally, we conduct a human evaluation, which shows that people prefer 82% of the documents generated with our method, while correlating more highly with our unified evaluation framework than prior metrics in the literature.
Paper Structure (29 sections, 6 equations, 7 figures, 5 tables)

This paper contains 29 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Visualization of our method to unify the generation and evaluation of templatic views of documents. Given an input document, we prompt the LLM to generate an intermediate representation. We can use the representation to prompt the model to generate a templatic view of the input document. We then evaluate the generations using our unified evaluation framework. The LLM represented in the figure is the same model.
  • Figure 2: Example of the process of obtaining the rankings for the precision ordering penalty. We first use the similarity measure to map each generated panel to its most similar reference document. This mapping is used to calculate the precision quality score $Q_P$. We then use the mappings to create a one-to-one alignment from the generated to the reference panels, which we use to calculate the precision ordering penalty ($O_P$). By creating a one-to-one alignment, we are able to represent inversions in the ordering. This process is reflexive, and panels not accounted for in the precision ordering penalty are accounted for in the recall ordering penalty.
  • Figure 3: Reasons annotators preferred each document. While annotators largely preferred documents generated with an intermediate representation, the most common reasons for preference are better formatting and information content. We exclude the "Other" count as it was only selected once.
  • Figure 4: Template of the intermediate representation provided to the prompts in Table \ref{['tbl:prompts']}.
  • Figure 5: The above documents are example slides generated by GPT4 ($\texttt{gpt4-32k}$) with and without the intermediate representation. We can see that without the intermediate step, the model did not generate a true slide deck.
  • ...and 2 more figures