Table of Contents
Fetching ...

Faithful Chart Summarization with ChaTS-Pi

Syrine Krichene, Francesco Piccinno, Fangyu Liu, Julian Martin Eisenschlos

TL;DR

ChaTS-Critic introduces a reference-free faithfulness metric for chart-to-summary tasks by de-rendering charts into tables and applying entailment to each sentence; ChaTS-Pi pipelines this metric to repair and re-rank candidate summaries, achieving state-of-the-art results on Chart-To-Text and SciCap benchmarks. The approach addresses limitations of reference-based metrics and reduces hallucination by removing unsupported sentences. Experiments show strong correlations with human judgments and improved summary quality across datasets; multilingual generalization is explored with mixed results. The method relies on DePlot for de-rendering and PaLM-2 or PALM-2(L) with Chain-of-Thought prompting for entailment.

Abstract

Chart-to-summary generation can help explore data, communicate insights, and help the visually impaired people. Multi-modal generative models have been used to produce fluent summaries, but they can suffer from factual and perceptual errors. In this work we present CHATS-CRITIC, a reference-free chart summarization metric for scoring faithfulness. CHATS-CRITIC is composed of an image-to-text model to recover the table from a chart, and a tabular entailment model applied to score the summary sentence by sentence. We find that CHATS-CRITIC evaluates the summary quality according to human ratings better than reference-based metrics, either learned or n-gram based, and can be further used to fix candidate summaries by removing not supported sentences. We then introduce CHATS-PI, a chart-to-summary pipeline that leverages CHATS-CRITIC during inference to fix and rank sampled candidates from any chart-summarization model. We evaluate CHATS-PI and CHATS-CRITIC using human raters, establishing state-of-the-art results on two popular chart-to-summary datasets.

Faithful Chart Summarization with ChaTS-Pi

TL;DR

ChaTS-Critic introduces a reference-free faithfulness metric for chart-to-summary tasks by de-rendering charts into tables and applying entailment to each sentence; ChaTS-Pi pipelines this metric to repair and re-rank candidate summaries, achieving state-of-the-art results on Chart-To-Text and SciCap benchmarks. The approach addresses limitations of reference-based metrics and reduces hallucination by removing unsupported sentences. Experiments show strong correlations with human judgments and improved summary quality across datasets; multilingual generalization is explored with mixed results. The method relies on DePlot for de-rendering and PaLM-2 or PALM-2(L) with Chain-of-Thought prompting for entailment.

Abstract

Chart-to-summary generation can help explore data, communicate insights, and help the visually impaired people. Multi-modal generative models have been used to produce fluent summaries, but they can suffer from factual and perceptual errors. In this work we present CHATS-CRITIC, a reference-free chart summarization metric for scoring faithfulness. CHATS-CRITIC is composed of an image-to-text model to recover the table from a chart, and a tabular entailment model applied to score the summary sentence by sentence. We find that CHATS-CRITIC evaluates the summary quality according to human ratings better than reference-based metrics, either learned or n-gram based, and can be further used to fix candidate summaries by removing not supported sentences. We then introduce CHATS-PI, a chart-to-summary pipeline that leverages CHATS-CRITIC during inference to fix and rank sampled candidates from any chart-summarization model. We evaluate CHATS-PI and CHATS-CRITIC using human raters, establishing state-of-the-art results on two popular chart-to-summary datasets.
Paper Structure (52 sections, 1 equation, 9 figures, 10 tables)

This paper contains 52 sections, 1 equation, 9 figures, 10 tables.

Figures (9)

  • Figure 1: ChaTS-Pi generates multiple summaries given the chart using any summarization model. Each summary is then repaired by dropping refuted sentences according to the ChaTS-Critic sentence scoring. Finally, we rank the summaries by computing the ratio of sentences that were kept.
  • Figure 2: ChaTS-Critic is composed of a de-rendering model to extract the table from the chart, and a table entailment model. The latter can be a blackbox table entailment model (e.g., TabFact as benchmarked in \ref{['tab:chats-critic-size']}) or an LLM; in latter case, we use CoT prompt and average over $8$ samples. In the figure, the threshold to reach a binary decision is set to $T=0.75$. The chart icon refers to the same plot of \ref{['fig:example_statista']}.
  • Figure 3: This example from kantharaj-etal-2022-chart showcases the limits of reference-based metrics for summary evaluation: (1) the reference text often contains extra information that is not present in the chart which skews the evaluation, and (2) the reference-based metrics can fail at capturing unreferenced but correct sentences. In comparison, ChaTS-Critic better reflects the human ratings for summary faithfulness.
  • Figure 4: Examples from TATA with ChaTS-Pi summaries using the demo and publicly accessible models.
  • Figure 5: PALM 3-shots prompting for summary generation
  • ...and 4 more figures