Table of Contents
Fetching ...

CS-PaperSum: A Large-Scale Dataset of AI-Generated Summaries for Scientific Papers

Javin Liu, Aryan Vats, Zihao He

TL;DR

CS-PaperSum tackles the challenge of scalable scientific literature analysis in computer science by providing structured AI-generated summaries for nearly 92k papers across 31 conferences. The approach uses GPT-3.5 to extract key contributions, methodologies, evaluation metrics, and future directions, enabling standardized cross-paper comparisons. The authors validate the quality with SciBERT-based embeddings and KeyBERT keyword overlap, showing faithful semantic preservation and topic retention. A case study demonstrates how the dataset supports trend detection across major AI conferences, highlighting shifts toward self-supervised learning, retrieval-augmented generation, and multimodal AI, with broader implications for automated literature reviews and AI-driven discovery.

Abstract

The rapid expansion of scientific literature in computer science presents challenges in tracking research trends and extracting key insights. Existing datasets provide metadata but lack structured summaries that capture core contributions and methodologies. We introduce CS-PaperSum, a large-scale dataset of 91,919 papers from 31 top-tier computer science conferences, enriched with AI-generated structured summaries using ChatGPT. To assess summary quality, we conduct embedding alignment analysis and keyword overlap analysis, demonstrating strong preservation of key concepts. We further present a case study on AI research trends, highlighting shifts in methodologies and interdisciplinary crossovers, including the rise of self-supervised learning, retrieval-augmented generation, and multimodal AI. Our dataset enables automated literature analysis, research trend forecasting, and AI-driven scientific discovery, providing a valuable resource for researchers, policymakers, and scientific information retrieval systems.

CS-PaperSum: A Large-Scale Dataset of AI-Generated Summaries for Scientific Papers

TL;DR

CS-PaperSum tackles the challenge of scalable scientific literature analysis in computer science by providing structured AI-generated summaries for nearly 92k papers across 31 conferences. The approach uses GPT-3.5 to extract key contributions, methodologies, evaluation metrics, and future directions, enabling standardized cross-paper comparisons. The authors validate the quality with SciBERT-based embeddings and KeyBERT keyword overlap, showing faithful semantic preservation and topic retention. A case study demonstrates how the dataset supports trend detection across major AI conferences, highlighting shifts toward self-supervised learning, retrieval-augmented generation, and multimodal AI, with broader implications for automated literature reviews and AI-driven discovery.

Abstract

The rapid expansion of scientific literature in computer science presents challenges in tracking research trends and extracting key insights. Existing datasets provide metadata but lack structured summaries that capture core contributions and methodologies. We introduce CS-PaperSum, a large-scale dataset of 91,919 papers from 31 top-tier computer science conferences, enriched with AI-generated structured summaries using ChatGPT. To assess summary quality, we conduct embedding alignment analysis and keyword overlap analysis, demonstrating strong preservation of key concepts. We further present a case study on AI research trends, highlighting shifts in methodologies and interdisciplinary crossovers, including the rise of self-supervised learning, retrieval-augmented generation, and multimodal AI. Our dataset enables automated literature analysis, research trend forecasting, and AI-driven scientific discovery, providing a valuable resource for researchers, policymakers, and scientific information retrieval systems.

Paper Structure

This paper contains 19 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Fraction (%) of papers published in each conference. Only conferences that contribute at least 2.3% of the total publications are shown.
  • Figure 2: Fraction (%) of papers published by the most productive affiliations. We focus on the leading universities, industry research labs, and institutions contributing to computer science research.
  • Figure 3: t-SNE visualization of paper embeddings based on (a) the original paper content, including the title, abstract, and conclusion, and (b) the ChatGPT-generated summaries ("Key Takeaways"). Each point represents a paper, and different colors indicate different conferences. The spatial clustering patterns suggest that the AI-generated summaries effectively preserve the semantic structure of the original papers while maintaining distinctions between research domains.
  • Figure 4: Keyword overlap between original papers and their ChatGPT-generated summaries (“Key Takeaways”) across different conferences. The overlap is measured using KeyBERT-based keyword extraction, where higher values indicate greater retention of key concepts in the summaries. The results demonstrate that the AI-generated summaries effectively capture the main topics of the original papers while varying slightly across different research domains.