Table of Contents
Fetching ...

LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams

Yongxuan Wu, Runyu Chen, Peiyu Liu, Hongjin Qian

TL;DR

LiveLongBench addresses the challenge of long-context understanding in spoken language by introducing a bilingual, live-stream-derived benchmark with average context lengths near $97K$ tokens. It systematically evaluates both foundation models and KV cache compression strategies across retrieval, reasoning, and hybrid tasks, revealing substantial gaps between current systems and human performance, especially in long, informal discourse. A hybrid compression approach, validated through a Data Envelopment Analysis framework, provides superior performance-memory trade-offs and highlights the potential for efficient real-world deployment in e-commerce settings. The work fills a crucial gap in evaluating long-context spoken language understanding and offers practical guidance for deploying robust, memory-efficient LLMs in live-stream applications.

Abstract

Long-context understanding poses significant challenges in natural language processing, particularly for real-world dialogues characterized by speech-based elements, high redundancy, and uneven information density. Although large language models (LLMs) achieve impressive results on existing benchmarks, these datasets fail to reflect the complexities of such texts, limiting their applicability to practical scenarios. To bridge this gap, we construct the first spoken long-text dataset, derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-world scenarios. We construct tasks in three categories: retrieval-dependent, reasoning-dependent, and hybrid. We then evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks. Our results show that current methods exhibit strong task-specific preferences and perform poorly on highly redundant inputs, with no single method consistently outperforming others. We propose a new baseline that better handles redundancy in spoken text and achieves strong performance across tasks. Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding. Finally, our benchmark fills a gap in evaluating long-context spoken language understanding and provides a practical foundation for developing real-world e-commerce systems. The code and benchmark are available at https://github.com/Yarayx/livelongbench.

LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams

TL;DR

LiveLongBench addresses the challenge of long-context understanding in spoken language by introducing a bilingual, live-stream-derived benchmark with average context lengths near tokens. It systematically evaluates both foundation models and KV cache compression strategies across retrieval, reasoning, and hybrid tasks, revealing substantial gaps between current systems and human performance, especially in long, informal discourse. A hybrid compression approach, validated through a Data Envelopment Analysis framework, provides superior performance-memory trade-offs and highlights the potential for efficient real-world deployment in e-commerce settings. The work fills a crucial gap in evaluating long-context spoken language understanding and offers practical guidance for deploying robust, memory-efficient LLMs in live-stream applications.

Abstract

Long-context understanding poses significant challenges in natural language processing, particularly for real-world dialogues characterized by speech-based elements, high redundancy, and uneven information density. Although large language models (LLMs) achieve impressive results on existing benchmarks, these datasets fail to reflect the complexities of such texts, limiting their applicability to practical scenarios. To bridge this gap, we construct the first spoken long-text dataset, derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-world scenarios. We construct tasks in three categories: retrieval-dependent, reasoning-dependent, and hybrid. We then evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks. Our results show that current methods exhibit strong task-specific preferences and perform poorly on highly redundant inputs, with no single method consistently outperforming others. We propose a new baseline that better handles redundancy in spoken text and achieves strong performance across tasks. Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding. Finally, our benchmark fills a gap in evaluating long-context spoken language understanding and provides a practical foundation for developing real-world e-commerce systems. The code and benchmark are available at https://github.com/Yarayx/livelongbench.

Paper Structure

This paper contains 43 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Distribution of Data Categories Across E-Commerce Domains
  • Figure 2: Showcase of Three Evaluation Tasks in LiveLongBench
  • Figure 3: Performance of Context Compression Methods on LLaMA-3.1-8B-Instruct. "K." denotes KIVI, "M." denotes MInference, and "L." denotes LLMLingua, while "2x" and "4x" refer to compression ratios. Methods shown in bold along the x-axis represent multi-methods. From left to right, the methods are arranged in descending order of their Overall average scores. For each bar, the darker segment represents the "Exact Match (%)" score of the corresponding method. Detailed results are provided in Table \ref{['tab:method_main']} in the Appendix.
  • Figure 4: Efficiency Scores Based on DEA Analysis
  • Figure 5: Wordcloud
  • ...and 5 more figures