ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage
Taewhoo Lee, Chanwoong Yoon, Kyochul Jang, Donghyeon Lee, Minju Song, Hyunjae Kim, Jaewoo Kang
TL;DR
ETHIC introduces Information Coverage (IC) to quantify how much of the provided long context is actually needed to answer a query, and presents ETHIC, a benchmark with 1,986 high-IC instances across books, debates, medicine, and law. The study shows contemporary LLMs struggle substantially on high-IC tasks, even with very long contexts, and analyzes factors like context length, information position, and degeneration in generation. The work demonstrates a clear gap between current long-context capabilities and the demands of full-context utilization, providing a foundation for developing models and evaluation methods that better leverage extended inputs. The ETHIC framework and IC metric offer a rigorous benchmark for advancing long-context NLP with real-world, multi-domain applicability.
Abstract
Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context. Our benchmark comprises 1,986 test instances spanning four long-context tasks with high IC scores in the domains of books, debates, medicine, and law. Our evaluations reveal significant performance drops in contemporary LLMs, highlighting a critical challenge in managing long contexts. Our benchmark is available at https://github.com/dmis-lab/ETHIC.
