Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length

Jingxuan Chen; Mohammad Taher Pilehvar; Jose Camacho-Collados

Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length

Jingxuan Chen, Mohammad Taher Pilehvar, Jose Camacho-Collados

Abstract

Users often rely on Large Language Models (LLMs) for processing multiple documents or performing analysis over a number of instances. For example, analysing the overall sentiment of a number of movie reviews requires an LLM to process the sentiment of each review individually in order to provide a final aggregated answer. While LLM performance on such individual tasks is generally high, there has been little research on how LLMs perform when dealing with multi-instance inputs. In this paper, we perform a comprehensive evaluation of the multi-instance processing (MIP) ability of LLMs for tasks in which they excel individually. The results show that all LLMs follow a pattern of slight performance degradation for small numbers of instances (approximately 20-100), followed by a performance collapse on larger instance counts. Crucially, our analysis shows that while context length is associated with this degradation, the number of instances has a stronger effect on the final results. This finding suggests that when optimising LLM performance for MIP, attention should be paid to both context length and, in particular, instance count.

Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length

Abstract

Paper Structure (108 sections, 5 equations, 25 figures, 6 tables)

This paper contains 108 sections, 5 equations, 25 figures, 6 tables.

Introduction
Related Work
Long Context.
Batch Processing.
Multi-Instance Processing
Formulation
Filtering for Controlled Difficulty
Evaluation Metrics
Experimental Setting
Individual Tasks
Models and Prompting
Single-Instance Filtering
MIP Sampling
RQ1: Performance and Failure Behaviours
Performance Analysis
...and 93 more sections

Figures (25)

Figure 1: A toy example of SIP and MIP settings for sentiment analysis, where an LLM succeeds under SIP but fails under MIP given the same instances.
Figure 2: Model success rates (averaged across all tasks) as a function of the number of instances. Error bars indicate standard deviation across five random seeds. LLMs from the same company share the same colour family, while markers denote categories: $\bullet$ (open-weight, $\geq$37B active parameters), $\blacksquare$ (open-weight, $\leq$22B active parameters), $\blacktriangle$ (frontier closed-source), and $\blacklozenge$ (lightweight closed-source).
Figure 3: Task success rates (averaged across all LLMs) as a function of the number of instances. Error bars indicate standard deviation across five random seeds.
Figure 4: Success rates as a function of the number of instances for the original instance order and two shuffled variants constructed from the same instance sets.
Figure 5: Breakdown of failure types. Key mistakes, aggregation mistakes, individual mistakes, and combined mistakes (Agg.+Indi.) are categorised as wrong answers (blue), while parsing errors and overlong input errors are categorised as invalid outputs (orange).
...and 20 more figures

Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length

Abstract

Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length

Authors

Abstract

Table of Contents

Figures (25)