Table of Contents
Fetching ...

Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks

Zi Yang

TL;DR

This paper presents the Dolce framework, which parameterizes each problem by $\lambda$ (complexity) and $k$ (redundancy) and assigns to one of five predefined focus categories and proposes to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans.

Abstract

We argue that there are two major distinct capabilities in long context understanding: retrieval and holistic understanding. Understanding and further improving LLMs' long context capabilities would not be possible without knowing the tasks' focus categories. We aim to automatically identify retrieval focused and holistic understanding focused problems from suites of benchmarks and quantitatively measure the difficulty within each focus. In this paper, we present the Dolce framework, which parameterizes each problem by $λ$ (complexity) and $k$ (redundancy) and assigns to one of five predefined focus categories. We propose to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans. To find the $λ$ and $k$ for each problem, we further propose a mixture model of a non-parametric background noise component and a parametric/non-parametric hybrid oracle component, where we derive the probability functions parameterized by $λ$ and $k$ for both the correct-or-wrong (COW) scenario and the partial-point-in-grading (PIG) scenario. Our proposed methods can identify 0% to 67% of the problems are retrieval focused and 0% to 90% of the problems are holistic understanding focused across 44 existing long context evaluation tasks.

Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks

TL;DR

This paper presents the Dolce framework, which parameterizes each problem by (complexity) and (redundancy) and assigns to one of five predefined focus categories and proposes to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans.

Abstract

We argue that there are two major distinct capabilities in long context understanding: retrieval and holistic understanding. Understanding and further improving LLMs' long context capabilities would not be possible without knowing the tasks' focus categories. We aim to automatically identify retrieval focused and holistic understanding focused problems from suites of benchmarks and quantitatively measure the difficulty within each focus. In this paper, we present the Dolce framework, which parameterizes each problem by (complexity) and (redundancy) and assigns to one of five predefined focus categories. We propose to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans. To find the and for each problem, we further propose a mixture model of a non-parametric background noise component and a parametric/non-parametric hybrid oracle component, where we derive the probability functions parameterized by and for both the correct-or-wrong (COW) scenario and the partial-point-in-grading (PIG) scenario. Our proposed methods can identify 0% to 67% of the problems are retrieval focused and 0% to 90% of the problems are holistic understanding focused across 44 existing long context evaluation tasks.
Paper Structure (40 sections, 29 equations, 17 figures, 6 tables, 2 algorithms)

This paper contains 40 sections, 29 equations, 17 figures, 6 tables, 2 algorithms.

Figures (17)

  • Figure 1: Problem parameterization by $\lambda$ (complexity) and $k$ (redundancy). Category mapping is illustrated on the left and formally determined by the table on the right. $L$ represents full context.
  • Figure 2: Task focus categories. Tasks are sorted by the total percentage of Categories III to V.
  • Figure 3: Probability that the oracle model correctly answers the problem (i.e. a "1" outcome) and cannot answer the problem (i.e. an "IDK" outcome) under the COW assumption.
  • Figure 4: Probability that the oracle model observes a proportion at 1, 0.5 and 0 respectively, under the PIG assumption.
  • Figure 5: Example of a length statistics for L-Eval NQ task, where it shows different length distributions across all problems: (1) number of tokens for each input, (2) number of units if the contexts are split by the "<P>" tags, (3) number of tokens in each unit if the contexts are split by the "<P>" tag, (4) number of units if the inputs are split into sentences as identified by NLTK, (5) number of tokens in each unit if the inputs are split into sentences as identified by NLTK, (6) number of tokens in each instruction, and (7) number of tokens in each ground-truth answer. 50-th and 99-th percentiles are also annotated.
  • ...and 12 more figures