Table of Contents
Fetching ...

SKT5SciSumm -- Revisiting Extractive-Generative Approach for Multi-Document Scientific Summarization

Huy Quoc To, Ming Liu, Guangyan Huang, Hung-Nghiep Tran, Andr'e Greiner-Petter, Felix Beierle, Akiko Aizawa

TL;DR

This paper proposes SKT5SciSumm - a hybrid framework for multi-document scientific summarization (MDSS), leveraging the Sentence-Transformer version of Scientific Paper Embeddings using Citation-Informed Transformers using Citation-Informed Transformers (SPECTER) to encode and represent textual sentences, allowing for efficient extractive summarization using k-means clustering.

Abstract

Summarization for scientific text has shown significant benefits both for the research community and human society. Given the fact that the nature of scientific text is distinctive and the input of the multi-document summarization task is substantially long, the task requires sufficient embedding generation and text truncation without losing important information. To tackle these issues, in this paper, we propose SKT5SciSumm - a hybrid framework for multi-document scientific summarization (MDSS). We leverage the Sentence-Transformer version of Scientific Paper Embeddings using Citation-Informed Transformers (SPECTER) to encode and represent textual sentences, allowing for efficient extractive summarization using k-means clustering. We employ the T5 family of models to generate abstractive summaries using extracted sentences. SKT5SciSumm achieves state-of-the-art performance on the Multi-XScience dataset. Through extensive experiments and evaluation, we showcase the benefits of our model by using less complicated models to achieve remarkable results, thereby highlighting its potential in advancing the field of multi-document summarization for scientific text.

SKT5SciSumm -- Revisiting Extractive-Generative Approach for Multi-Document Scientific Summarization

TL;DR

This paper proposes SKT5SciSumm - a hybrid framework for multi-document scientific summarization (MDSS), leveraging the Sentence-Transformer version of Scientific Paper Embeddings using Citation-Informed Transformers using Citation-Informed Transformers (SPECTER) to encode and represent textual sentences, allowing for efficient extractive summarization using k-means clustering.

Abstract

Summarization for scientific text has shown significant benefits both for the research community and human society. Given the fact that the nature of scientific text is distinctive and the input of the multi-document summarization task is substantially long, the task requires sufficient embedding generation and text truncation without losing important information. To tackle these issues, in this paper, we propose SKT5SciSumm - a hybrid framework for multi-document scientific summarization (MDSS). We leverage the Sentence-Transformer version of Scientific Paper Embeddings using Citation-Informed Transformers (SPECTER) to encode and represent textual sentences, allowing for efficient extractive summarization using k-means clustering. We employ the T5 family of models to generate abstractive summaries using extracted sentences. SKT5SciSumm achieves state-of-the-art performance on the Multi-XScience dataset. Through extensive experiments and evaluation, we showcase the benefits of our model by using less complicated models to achieve remarkable results, thereby highlighting its potential in advancing the field of multi-document summarization for scientific text.
Paper Structure (21 sections, 1 equation, 5 figures, 7 tables)

This paper contains 21 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Our hybrid approach for multi-document scientific summarization.
  • Figure 2: Distribution of tokens from raw input text compared with extracted summaries in train and validation set using T5 tokenizer.
  • Figure 3: The voting results of two humans on generated resutls of SKT5SciSumm and GPT-4 compared to references.
  • Figure 4: The distribution of average relevance and readability scores for summaries generated by SKT5SciSum.
  • Figure 5: The distribution of average relevance and readability scores for summaries generated by GPT-4.