Table of Contents
Fetching ...

ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage

Taewhoo Lee, Chanwoong Yoon, Kyochul Jang, Donghyeon Lee, Minju Song, Hyunjae Kim, Jaewoo Kang

TL;DR

ETHIC introduces Information Coverage (IC) to quantify how much of the provided long context is actually needed to answer a query, and presents ETHIC, a benchmark with 1,986 high-IC instances across books, debates, medicine, and law. The study shows contemporary LLMs struggle substantially on high-IC tasks, even with very long contexts, and analyzes factors like context length, information position, and degeneration in generation. The work demonstrates a clear gap between current long-context capabilities and the demands of full-context utilization, providing a foundation for developing models and evaluation methods that better leverage extended inputs. The ETHIC framework and IC metric offer a rigorous benchmark for advancing long-context NLP with real-world, multi-domain applicability.

Abstract

Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context. Our benchmark comprises 1,986 test instances spanning four long-context tasks with high IC scores in the domains of books, debates, medicine, and law. Our evaluations reveal significant performance drops in contemporary LLMs, highlighting a critical challenge in managing long contexts. Our benchmark is available at https://github.com/dmis-lab/ETHIC.

ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage

TL;DR

ETHIC introduces Information Coverage (IC) to quantify how much of the provided long context is actually needed to answer a query, and presents ETHIC, a benchmark with 1,986 high-IC instances across books, debates, medicine, and law. The study shows contemporary LLMs struggle substantially on high-IC tasks, even with very long contexts, and analyzes factors like context length, information position, and degeneration in generation. The work demonstrates a clear gap between current long-context capabilities and the demands of full-context utilization, providing a foundation for developing models and evaluation methods that better leverage extended inputs. The ETHIC framework and IC metric offer a rigorous benchmark for advancing long-context NLP with real-world, multi-domain applicability.

Abstract

Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context. Our benchmark comprises 1,986 test instances spanning four long-context tasks with high IC scores in the domains of books, debates, medicine, and law. Our evaluations reveal significant performance drops in contemporary LLMs, highlighting a critical challenge in managing long contexts. Our benchmark is available at https://github.com/dmis-lab/ETHIC.

Paper Structure

This paper contains 48 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The variation in model performance with the level of information coverage (IC). Unlike low-IC tasks, which focus on specific parts of the input context, our benchmark features new high-IC tasks that demand the full utilization of all available information, posing a significant challenge for long-context models.
  • Figure 2: Overall description of ETHIC. Our benchmark includes four tasks: (a) the recalling task involves identifying specific types of entities in the text, (b) the summarizing task involves writing a summary for each section of the input, (c) the organizing task involves arranging mixed contents in the correct order, and (d) the attributing task focuses on identifying the underlying point of view within medical studies or legal documents.
  • Figure 3: The model's performance on low-IC and high-IC tasks. Low-IC tasks were created by generating new queries and answers using the same input context from our benchmark, which are represented by the gray bars on the left side of the graph (please refer to Section \ref{['subsec:localglobal']} for details). The yellow bars on the right represent high-IC tasks from our benchmark. The numbers (%) displayed in the bar graphs represent the IC values of the tasks. The y-axis indicates the model performance.
  • Figure 4: The performance with varied context lengths on low- and high-IC tasks. We used the single-document QA and recalling tasks from the books and debates domains for low- and high-IC tasks, respectively.
  • Figure 5: The effect of the position of information in the summarizing task. The x-axis represents the position of the chunks within the input context, while the y-axis represents the total length of the input context. Blue (red) chunks indicate summaries with high (low) scores.