Table of Contents
Fetching ...

Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

Frederik Broy, Maike Züfle, Jan Niehues

TL;DR

Talk2Ref introduces the Reference Prediction from Talks (RPT) task and provides the first large-scale dataset linking scientific talks to the papers cited in their source publications. The authors propose a dual-encoder approach inspired by dense passage retrieval and demonstrate substantial gains from domain adaptation and talk-specific aggregation strategies to handle long transcripts. Finetuning on Talk2Ref yields improved citation prediction compared with zero-shot baselines, confirming the dataset’s value for learning semantic representations across spoken and written scholarly content. The work highlights both the practicality and challenges of grounding spoken scientific talks in relevant literature, and releases the dataset and models under an open license to accelerate future research in spoken-content citation systems.

Abstract

Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk's corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.

Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

TL;DR

Talk2Ref introduces the Reference Prediction from Talks (RPT) task and provides the first large-scale dataset linking scientific talks to the papers cited in their source publications. The authors propose a dual-encoder approach inspired by dense passage retrieval and demonstrate substantial gains from domain adaptation and talk-specific aggregation strategies to handle long transcripts. Finetuning on Talk2Ref yields improved citation prediction compared with zero-shot baselines, confirming the dataset’s value for learning semantic representations across spoken and written scholarly content. The work highlights both the practicality and challenges of grounding spoken scientific talks in relevant literature, and releases the dataset and models under an open license to accelerate future research in spoken-content citation systems.

Abstract

Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk's corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.

Paper Structure

This paper contains 40 sections, 1 equation, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Illustration of the Talk2Ref dataset and its use in the task of Reference Prediction from Scientific Talks (RPT), where query talks are paired with their cited papers. The fast down button represents the information included in Talk2Ref, and robot represents the input used by our model for predicting cited papers.
  • Figure 2: Temporal distribution of cited works and abstracts in the dataset. The majority of references are concentrated between 2015 and 2022, with a marked increase from 2018 onward, reflecting the surge of research in natural language processing. This distribution ensures alignment with current research trends but underrepresents older foundational work.
  • Figure 3: Top 10 most frequently cited papers in the dataset.