Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks

Zi Yang

Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks

Zi Yang

TL;DR

This paper presents the Dolce framework, which parameterizes each problem by $\lambda$ (complexity) and $k$ (redundancy) and assigns to one of five predefined focus categories and proposes to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans.

Abstract

We argue that there are two major distinct capabilities in long context understanding: retrieval and holistic understanding. Understanding and further improving LLMs' long context capabilities would not be possible without knowing the tasks' focus categories. We aim to automatically identify retrieval focused and holistic understanding focused problems from suites of benchmarks and quantitatively measure the difficulty within each focus. In this paper, we present the Dolce framework, which parameterizes each problem by $λ$ (complexity) and $k$ (redundancy) and assigns to one of five predefined focus categories. We propose to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans. To find the $λ$ and $k$ for each problem, we further propose a mixture model of a non-parametric background noise component and a parametric/non-parametric hybrid oracle component, where we derive the probability functions parameterized by $λ$ and $k$ for both the correct-or-wrong (COW) scenario and the partial-point-in-grading (PIG) scenario. Our proposed methods can identify 0% to 67% of the problems are retrieval focused and 0% to 90% of the problems are holistic understanding focused across 44 existing long context evaluation tasks.

Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks

TL;DR

This paper presents the Dolce framework, which parameterizes each problem by

(complexity) and

(redundancy) and assigns to one of five predefined focus categories and proposes to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans.

Abstract

(complexity) and

(redundancy) and assigns to one of five predefined focus categories. We propose to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans. To find the

and

for each problem, we further propose a mixture model of a non-parametric background noise component and a parametric/non-parametric hybrid oracle component, where we derive the probability functions parameterized by

and

for both the correct-or-wrong (COW) scenario and the partial-point-in-grading (PIG) scenario. Our proposed methods can identify 0% to 67% of the problems are retrieval focused and 0% to 90% of the problems are holistic understanding focused across 44 existing long context evaluation tasks.

Paper Structure (40 sections, 29 equations, 17 figures, 6 tables, 2 algorithms)

This paper contains 40 sections, 29 equations, 17 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Dolce: Distinguish Our Long Context Evaluation Tasks
Sampling & Observation
Correct-Or-Wrong (COW) Scenario
Partial-Point-In-Grading (PIG) Scenario
Maximum Likelihood Estimation of $\lambda$, $k$
Preprocessing & Setups
Results
Correct-Or-Wrong (COW) Scenario Results
Partial-Point-In-Grading (PIG) Scenario Results
Examples: QuALITY & LongFQA
Further Analysis & Discussions
Conclusion & Future Work
Details of $\pi$ in the COW Scenario ($k$-Repeated Length-$\lambda$ Sufficient Spans)
...and 25 more sections

Figures (17)

Figure 1: Problem parameterization by $\lambda$ (complexity) and $k$ (redundancy). Category mapping is illustrated on the left and formally determined by the table on the right. $L$ represents full context.
Figure 2: Task focus categories. Tasks are sorted by the total percentage of Categories III to V.
Figure 3: Probability that the oracle model correctly answers the problem (i.e. a "1" outcome) and cannot answer the problem (i.e. an "IDK" outcome) under the COW assumption.
Figure 4: Probability that the oracle model observes a proportion at 1, 0.5 and 0 respectively, under the PIG assumption.
Figure 5: Example of a length statistics for L-Eval NQ task, where it shows different length distributions across all problems: (1) number of tokens for each input, (2) number of units if the contexts are split by the "<P>" tags, (3) number of tokens in each unit if the contexts are split by the "<P>" tag, (4) number of units if the inputs are split into sentences as identified by NLTK, (5) number of tokens in each unit if the inputs are split into sentences as identified by NLTK, (6) number of tokens in each instruction, and (7) number of tokens in each ground-truth answer. 50-th and 99-th percentiles are also annotated.
...and 12 more figures

Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks

TL;DR

Abstract

Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (17)