Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization
Jing Ma
TL;DR
The study investigates whether LLM-generated overviews weighted multiple input documents equally or exhibit a primacy bias. By using abortion-related triplets arranged in six orders and evaluating summaries with ROUGE-L, BERTScore, and SummaC, the authors find a consistent primacy effect in semantic alignment driven by the first-seen article, though lexical and factual metrics show weaker or non-significant effects. The results imply that AI Overviews and agentic AI pipelines may overweight early sources, potentially skewing user perception and downstream actions. The work highlights the importance of addressing input-order bias for more balanced, transparent, and trustworthy LLM-based information systems.
Abstract
Large language models (LLMs) are now used in settings such as Google's AI Overviews, where it summarizes multiple long documents. However, it remains unclear whether they weight all inputs equally. Focusing on abortion-related news, we construct 40 pro-neutral-con article triplets, permute each triplet into six input orders, and prompt Gemini 2.5 Flash to generate a neutral overview. We evaluate each summary against its source articles using ROUGE-L (lexical overlap), BERTScore (semantic similarity), and SummaC (factual consistency). One-way ANOVA reveals a significant primacy effect for BERTScore across all stances, indicating that summaries are more semantically aligned with the first-seen article. Pairwise comparisons further show that Position 1 differs significantly from Positions 2 and 3, while the latter two do not differ from each other, confirming a selective preference for the first document. The findings present risks for applications that rely on LLM-generated overviews and for agentic AI systems, where the steps involving LLMs can disproportionately influence downstream actions.
