SEGMENT+: Long Text Processing with Short-Context Language Models

Wei Shi; Shuang Li; Kerun Yu; Jinglei Chen; Zujie Liang; Xinhui Wu; Yuxi Qian; Feng Wei; Bo Zheng; Jiaqing Liang; Jiangjie Chen; Yanghua Xiao

SEGMENT+: Long Text Processing with Short-Context Language Models

Wei Shi, Shuang Li, Kerun Yu, Jinglei Chen, Zujie Liang, Xinhui Wu, Yuxi Qian, Feng Wei, Bo Zheng, Jiaqing Liang, Jiangjie Chen, Yanghua Xiao

TL;DR

Segment+ addresses the challenge of long-input processing for language models with restricted context windows by introducing a two-stage framework that gathers and merges information using structured notes and a filtering module. It defines notes with Evidence and Reasoning, enabling controllable information flow and interpretable reasoning, and applies batch merging to fit within the limited context windows. Across long-document QA and Babilong style tasks, Segment+ yields robust performance gains across model sizes, outperforming retrieval-augmented and agent-based baselines and demonstrating strong noise resistance and efficiency. The work highlights the importance of explicit information control, ablation-backed design choices, and segment size analysis, with implications for scalable long-text understanding and broader applications in memory management and multimedia contexts.

Abstract

There is a growing interest in expanding the input capacity of language models (LMs) across various domains. However, simply increasing the context window does not guarantee robust performance across diverse long-input processing tasks, such as understanding extensive documents and extracting detailed information from lengthy and noisy data. In response, we introduce SEGMENT+, a general framework that enables LMs to handle extended inputs within limited context windows efficiently. SEGMENT+ utilizes structured notes and a filtering module to manage information flow, resulting in a system that is both controllable and interpretable. Our extensive experiments across various model sizes, focusing on long-document question-answering and Needle-in-a-Haystack tasks, demonstrate the effectiveness of SEGMENT+ in improving performance.

SEGMENT+: Long Text Processing with Short-Context Language Models

TL;DR

Abstract

Paper Structure (26 sections, 1 equation, 5 figures, 3 tables)

This paper contains 26 sections, 1 equation, 5 figures, 3 tables.

Introduction
Related Work
Retrieval-augmented Generation
Long Context LMs
Memory Management
Method
Problem Formulation
Segment+
Experiments
Long Document Question Answering
Benchmarks
Baselines
Main Results
Needle-in-a-Haystack Question Answering
Benchmark
...and 11 more sections

Figures (5)

Figure 1: This picture illustrates the use of short-context models to tackle long document question answering tasks in Segment+. The process begins by gathering relevant context from the document for a specific question. Only notes labeled 'keep' are used as the context to derive the final answer, avoiding noise.
Figure 2: The proposed framework for Segment+ consists of three main components. First, a gathering module collects structural information for a given query, distinguishing direct, accurate context (evidence) from the model’s potentially misleading analysis (reasoning). Next, a filter module filters out noisy segments for dense information management. Finally, we merge this information in batches, taking into account the limited context window of the merging language model, to produce a suitable length context optimized for final answering.
Figure 3: Babilong kuratov2024search Test Performance Comparison. The x-axis represents the length of the input. The y-axis shows the Exact Match (EM) performance on the Babilong task. Results for GPT-4 are taken from Babilong, with each task consisting of 25 items, consistent with the Babilong setting. The average accuracy (Avg acc) for vanilla models and Segment+ (GPT-4) denotes the mean score of all colored cells. However, for Segment+ (ChatGPT) and Segment+ (Mistral-7B), we calculate two average scores: the initial score represents the average over valid contexts for comparison with vanilla models, while the subsequent score indicates the average over all cells. Green indicates higher performance, while red signifies lower performance. Segment+ enhances overall accuracy and maintains stable performance as context length increases.
Figure 4: Ablation study results. 'No Label' refers to the condition without information filtering, 'No Structure' refers to the absence of a structured prompt, and 'Normal' indicates the model operates without both filtering and structured prompts. The results demonstrate that both design elements contribute to the final performance.
Figure 5: Segment Size Results. The average performance in long document question-answering tasks remains stable across different segment sizes, with optimal results achieved at a segment size of 3000.

SEGMENT+: Long Text Processing with Short-Context Language Models

TL;DR

Abstract

SEGMENT+: Long Text Processing with Short-Context Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)