Table of Contents
Fetching ...

BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization

Desta Haileselassie Hagos, Legand L. Burge, Anietie Andy, Anis Yazidi, Vladimir Vlassov

TL;DR

BiSparse-AAS introduces a unified, efficient framework for long-sequence text summarization by integrating bilinear attention with sparse attention and adaptive attention spans. By replacing standard self-attention with a learnable bilinear form and dynamically masking and extending attention ranges, the approach achieves near-linear scalability while preserving contextual coherence. Empirical results across CNN/DailyMail, XSum, OpenWebText, and Gigaword report strong ROUGE and semantic metrics, with notable parameter and compute reductions compared to GPT-2-like baselines. The framework serves as a drop-in, resource-friendly solution for extractive and abstractive summarization and offers a scalable path for broader long-sequence NLP tasks.

Abstract

Transformer-based architectures have advanced text summarization, yet their quadratic complexity limits scalability on long documents. This paper introduces BiSparse-AAS (Bilinear Sparse Attention with Adaptive Spans), a novel framework that combines sparse attention, adaptive spans, and bilinear attention to address these limitations. Sparse attention reduces computational costs by focusing on the most relevant parts of the input, while adaptive spans dynamically adjust the attention ranges. Bilinear attention complements both by modeling complex token interactions within this refined context. BiSparse-AAS consistently outperforms state-of-the-art baselines in both extractive and abstractive summarization tasks, achieving average ROUGE improvements of about 68.1% on CNN/DailyMail and 52.6% on XSum, while maintaining strong performance on OpenWebText and Gigaword datasets. By addressing efficiency, scalability, and long-sequence modeling, BiSparse-AAS provides a unified, practical solution for real-world text summarization applications.

BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization

TL;DR

BiSparse-AAS introduces a unified, efficient framework for long-sequence text summarization by integrating bilinear attention with sparse attention and adaptive attention spans. By replacing standard self-attention with a learnable bilinear form and dynamically masking and extending attention ranges, the approach achieves near-linear scalability while preserving contextual coherence. Empirical results across CNN/DailyMail, XSum, OpenWebText, and Gigaword report strong ROUGE and semantic metrics, with notable parameter and compute reductions compared to GPT-2-like baselines. The framework serves as a drop-in, resource-friendly solution for extractive and abstractive summarization and offers a scalable path for broader long-sequence NLP tasks.

Abstract

Transformer-based architectures have advanced text summarization, yet their quadratic complexity limits scalability on long documents. This paper introduces BiSparse-AAS (Bilinear Sparse Attention with Adaptive Spans), a novel framework that combines sparse attention, adaptive spans, and bilinear attention to address these limitations. Sparse attention reduces computational costs by focusing on the most relevant parts of the input, while adaptive spans dynamically adjust the attention ranges. Bilinear attention complements both by modeling complex token interactions within this refined context. BiSparse-AAS consistently outperforms state-of-the-art baselines in both extractive and abstractive summarization tasks, achieving average ROUGE improvements of about 68.1% on CNN/DailyMail and 52.6% on XSum, while maintaining strong performance on OpenWebText and Gigaword datasets. By addressing efficiency, scalability, and long-sequence modeling, BiSparse-AAS provides a unified, practical solution for real-world text summarization applications.

Paper Structure

This paper contains 22 sections, 10 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Model architecture.
  • Figure 2: Validation ROUGE scores across 500K training steps for sparse attention and adaptive spans. The curves illustrate the training dynamics and stability of each mechanism, complementing the dataset-specific results discussed in Section \ref{['results_and_discussion']}. Due to space limitations, we report R-1, R-2, and R-L scores for sparse attention and the R-1 score for adaptive spans.
  • Figure :