Table of Contents
Fetching ...

ScaleFormer: Span Representation Cumulation for Long-Context Transformer

Jiangshu Du, Wenpeng Yin, Philip Yu

TL;DR

Transformer self-attention scales as $O(N^2)$, hindering long-context tasks. ScaleFormer wraps pre-trained encoder–decoder models with a chunking strategy and a parameter-free Span Representation Cumulation that injects directional, boundary-based context into each segment, yielding a linear-time solution $O(N)$ without architectural changes. It introduces middle token sampling to enrich local content and concatenates fused boundaries with sampled interior tokens to feed the decoder, achieving competitive or state-of-the-art results on SummScreen, GovReport, and BookSum across BART-base and T5-base backbones. This approach enables efficient long-form reasoning with minimal retraining or external retrieval, offering practical impact for long-document summarization and related tasks.

Abstract

The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer(Span Representation Cumulation for Long-Context Transformer) - a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk's representation with structural awareness of its position within the document. It achieves this by enriching each chunk's boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document's narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.

ScaleFormer: Span Representation Cumulation for Long-Context Transformer

TL;DR

Transformer self-attention scales as , hindering long-context tasks. ScaleFormer wraps pre-trained encoder–decoder models with a chunking strategy and a parameter-free Span Representation Cumulation that injects directional, boundary-based context into each segment, yielding a linear-time solution without architectural changes. It introduces middle token sampling to enrich local content and concatenates fused boundaries with sampled interior tokens to feed the decoder, achieving competitive or state-of-the-art results on SummScreen, GovReport, and BookSum across BART-base and T5-base backbones. This approach enables efficient long-form reasoning with minimal retraining or external retrieval, offering practical impact for long-document summarization and related tasks.

Abstract

The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer(Span Representation Cumulation for Long-Context Transformer) - a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk's representation with structural awareness of its position within the document. It achieves this by enriching each chunk's boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document's narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.

Paper Structure

This paper contains 16 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the ScaleFormer framework. A long input document is segmented into overlapping chunks, each encoded independently. Boundary token representations (Left and Right) are extracted and fused with directional context. The Left boundary of a chunk is fused with context from prior chunks, and the Right boundary is fused with context from subsequent chunks, providing structural awareness.
  • Figure 2: Ablation study results on the SummScreen dev set. We analyze the impact of (a) the directional fusion ratio $\alpha$, (b) the number of middle tokens sampled ($m$), and (c) the chunk overlap size ($O$). Performance is measured in ROUGE-L.