Table of Contents
Fetching ...

Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization

Jing Ma

TL;DR

The study investigates whether LLM-generated overviews weighted multiple input documents equally or exhibit a primacy bias. By using abortion-related triplets arranged in six orders and evaluating summaries with ROUGE-L, BERTScore, and SummaC, the authors find a consistent primacy effect in semantic alignment driven by the first-seen article, though lexical and factual metrics show weaker or non-significant effects. The results imply that AI Overviews and agentic AI pipelines may overweight early sources, potentially skewing user perception and downstream actions. The work highlights the importance of addressing input-order bias for more balanced, transparent, and trustworthy LLM-based information systems.

Abstract

Large language models (LLMs) are now used in settings such as Google's AI Overviews, where it summarizes multiple long documents. However, it remains unclear whether they weight all inputs equally. Focusing on abortion-related news, we construct 40 pro-neutral-con article triplets, permute each triplet into six input orders, and prompt Gemini 2.5 Flash to generate a neutral overview. We evaluate each summary against its source articles using ROUGE-L (lexical overlap), BERTScore (semantic similarity), and SummaC (factual consistency). One-way ANOVA reveals a significant primacy effect for BERTScore across all stances, indicating that summaries are more semantically aligned with the first-seen article. Pairwise comparisons further show that Position 1 differs significantly from Positions 2 and 3, while the latter two do not differ from each other, confirming a selective preference for the first document. The findings present risks for applications that rely on LLM-generated overviews and for agentic AI systems, where the steps involving LLMs can disproportionately influence downstream actions.

Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization

TL;DR

The study investigates whether LLM-generated overviews weighted multiple input documents equally or exhibit a primacy bias. By using abortion-related triplets arranged in six orders and evaluating summaries with ROUGE-L, BERTScore, and SummaC, the authors find a consistent primacy effect in semantic alignment driven by the first-seen article, though lexical and factual metrics show weaker or non-significant effects. The results imply that AI Overviews and agentic AI pipelines may overweight early sources, potentially skewing user perception and downstream actions. The work highlights the importance of addressing input-order bias for more balanced, transparent, and trustworthy LLM-based information systems.

Abstract

Large language models (LLMs) are now used in settings such as Google's AI Overviews, where it summarizes multiple long documents. However, it remains unclear whether they weight all inputs equally. Focusing on abortion-related news, we construct 40 pro-neutral-con article triplets, permute each triplet into six input orders, and prompt Gemini 2.5 Flash to generate a neutral overview. We evaluate each summary against its source articles using ROUGE-L (lexical overlap), BERTScore (semantic similarity), and SummaC (factual consistency). One-way ANOVA reveals a significant primacy effect for BERTScore across all stances, indicating that summaries are more semantically aligned with the first-seen article. Pairwise comparisons further show that Position 1 differs significantly from Positions 2 and 3, while the latter two do not differ from each other, confirming a selective preference for the first document. The findings present risks for applications that rely on LLM-generated overviews and for agentic AI systems, where the steps involving LLMs can disproportionately influence downstream actions.

Paper Structure

This paper contains 9 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the experimental pipeline. Triplets of PRO/NEUTRAL/CON news articles on abortion were collected and annotated, permuted into six input orders, and summarized by Gemini 2.5 Flash. Resulting summaries were evaluated using ROUGE-L, BERTScore, and SummaC, and statistical differences across positions were tested with one-way ANOVA to detect primacy effects.
  • Figure 2: Distribution of six major U.S. news sources across PRO, NEUTRAL, and CON abortion stances. Each bar shows the percentage contribution of each source within a stance category.
  • Figure 3: Effects of input order on similarity scores across ROUGE-L, BERTScore, and SummaC. Each heatmap shows the mean metric value for a given stance (CON, NEUTRAL, PRO) under six possible input sequences.