Table of Contents
Fetching ...

When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework

Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun, Chi Wang, James Zou, Ce Zhang

TL;DR

This work proposes a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross-chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise).

Abstract

We investigate the challenge of applying Large Language Models (LLMs) to long texts. We propose a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross-chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise). Under this view, we analyze when it is effective to use multi-agent chunking, i.e., dividing a lengthy sequence into smaller chunks and aggregating the processed results of each chunk. Our experiments on tasks such as retrieval, question answering, and summarization confirm both the theoretical analysis and the conditions that favor multi-agent chunking. By exploring the accelerated decay of model fidelity with input length, we also explain why, for large inputs, a weaker model configured with chunk-based processing can surpass a more advanced model like GPT4o applied in a single shot. Overall, we present a principled understanding framework and our results highlight a direct pathway to handling long contexts in LLMs with carefully managed chunking and aggregator strategies.

When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework

TL;DR

This work proposes a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross-chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise).

Abstract

We investigate the challenge of applying Large Language Models (LLMs) to long texts. We propose a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross-chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise). Under this view, we analyze when it is effective to use multi-agent chunking, i.e., dividing a lengthy sequence into smaller chunks and aggregating the processed results of each chunk. Our experiments on tasks such as retrieval, question answering, and summarization confirm both the theoretical analysis and the conditions that favor multi-agent chunking. By exploring the accelerated decay of model fidelity with input length, we also explain why, for large inputs, a weaker model configured with chunk-based processing can surpass a more advanced model like GPT4o applied in a single shot. Overall, we present a principled understanding framework and our results highlight a direct pathway to handling long contexts in LLMs with carefully managed chunking and aggregator strategies.

Paper Structure

This paper contains 60 sections, 1 theorem, 19 equations, 4 figures, 9 tables.

Key Result

Proposition 3.1

Let $\mathcal{L}_{\mathrm{strong}}(T)$ be the loss of a single strong model and $\mathcal{L}_{\mathrm{D\&C}}(T)$ be the loss of a divide-and-conquer system on input length $T$. Assume: Then, the D&C loss accumulates linearly ($\mathcal{L}_{\mathrm{D\&C}}(T) = O(T)$), and there exists a critical threshold $T_0$ such that for all $T > T_0$, the D&C system strictly outperforms the single strong mode

Figures (4)

  • Figure 1: A simple implementation of the divide and conquer framework. The Math task example (right panel) illustrates the planner's critical role by translating instructions of returning 2nd smallest number into returning the two smallest numbers per chunk.
  • Figure 2: Joint effect of the task term (decomposability / cross-chunk dependency, $\mathcal{L}_{\mathrm{task}}$) and the model term (length-induced degradation, $\mathcal{L}_{\mathrm{model}}$). According to the discussion in Sec \ref{['sec:regimes']}, (a) has negligible task/model terms; (b)-(e) are model-term dominated; (f) is task-term dominated.
  • Figure 3: Aggregator errors across different tasks and models.
  • Figure :

Theorems & Definitions (1)

  • Proposition 3.1: The D&C Advantage