BRIDO: Bringing Democratic Order to Abstractive Summarization
Junhyun Lee, Harshith Goka, Hyeonmok Ko
TL;DR
BRIDO tackles hallucination in abstractive summarization by addressing exposure bias and moving beyond reference-led evaluation. It extends BRIO with a democratic ordering scheme that ranks candidate summaries using inter-candidate ROUGE and a contrastive learning objective, formalized via $Score_{BRIDO}(S_i)=\frac{\sum_{j\neq i} R(S_i,S_j)+\alpha R(S_i,S^*)}{N-1+\alpha}$ and $\mathcal{L}=\mathcal{L}_{\text{xent}}+\gamma\mathcal{L}_{\text{ctr}}$, where $\mathcal{L}_{\text{ctr}}=\sum_i\sum_{j>i}\max(0,f(S_j)-f(S_i)+\lambda_{ij})$. Experiments on XSum and CNN/DM show that BRIDO yields 6.25% and 3.82% improvements in G-Eval consistency over BRIO, respectively, and outperforms base models on key hallucination metrics, indicating effective mitigation of hallucination while preserving summarization quality. The approach leverages diverse beam search, inter-candidate similarity, and adjustable parameters ($\eta$, $N_g$, $N$, $\alpha$, $\lambda$, $\gamma$) to balance diversity, faithfulness, and learning signals. The results suggest practical benefits for safer abstractive summarization and point to future work on human evaluation and extending BRIDO to decoder-only models.
Abstract
Hallucination refers to the inaccurate, irrelevant, and inconsistent text generated from large language models (LLMs). While the LLMs have shown great promise in a variety of tasks, the issue of hallucination still remains a major challenge for many practical uses. In this paper, we tackle the issue of hallucination in abstract text summarization by mitigating exposure bias. Existing models targeted for exposure bias mitigation, namely BRIO, aim for better summarization quality in the ROUGE score. We propose a model that uses a similar exposure bias mitigation strategy but with a goal that is aligned with less hallucination. We conjecture that among a group of candidate outputs, ones with hallucinations will comprise the minority of the whole group. That is, candidates with less similarity with others will have a higher chance of containing hallucinated content. Our method uses this aspect and utilizes contrastive learning, incentivizing candidates with high inter-candidate ROUGE scores. We performed experiments on the XSum and CNN/DM summarization datasets, and our method showed 6.25% and 3.82% improvement, respectively, on the consistency G-Eval score over BRIO.
