Table of Contents
Fetching ...

cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers

Anirudh Sundar, Jin Xu, William Gay, Christopher Richardson, Larry Heck

TL;DR

This work addresses the challenge of situated and multimodal conversational QA over scientific papers by introducing cPAPERS, a dataset of 5030 QA pairs across equations, figures, and tables grounded to OpenReview reviews and arXiv LaTeX sources. It describes a scalable seven-step data collection pipeline, including QA extraction with an LLM, crowdworker validation, and contextual grounding from LaTeX sources. Baseline experiments with zero-shot prompting and QLoRA fine-tuning reveal that context-aware grounding and modality-specific strategies can improve QA quality, highlighting benefits and limitations of weakly grounded multimodal context. The dataset and baselines pave the way for developing AI assistants capable of deep, document-grounded scientific inquiry, while also underscoring challenges from version mismatches and cross-version inconsistencies in scientific papers.

Abstract

An emerging area of research in situated and multimodal interactive conversations (SIMMC) includes interactions in scientific papers. Since scientific papers are primarily composed of text, equations, figures, and tables, SIMMC methods must be developed specifically for each component to support the depth of inquiry and interactions required by research scientists. This work introduces Conversational Papers (cPAPERS), a dataset of conversational question-answer pairs from reviews of academic papers grounded in these paper components and their associated references from scientific documents available on arXiv. We present a data collection strategy to collect these question-answer pairs from OpenReview and associate them with contextual information from LaTeX source files. Additionally, we present a series of baseline approaches utilizing Large Language Models (LLMs) in both zero-shot and fine-tuned configurations to address the cPAPERS dataset.

cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers

TL;DR

This work addresses the challenge of situated and multimodal conversational QA over scientific papers by introducing cPAPERS, a dataset of 5030 QA pairs across equations, figures, and tables grounded to OpenReview reviews and arXiv LaTeX sources. It describes a scalable seven-step data collection pipeline, including QA extraction with an LLM, crowdworker validation, and contextual grounding from LaTeX sources. Baseline experiments with zero-shot prompting and QLoRA fine-tuning reveal that context-aware grounding and modality-specific strategies can improve QA quality, highlighting benefits and limitations of weakly grounded multimodal context. The dataset and baselines pave the way for developing AI assistants capable of deep, document-grounded scientific inquiry, while also underscoring challenges from version mismatches and cross-version inconsistencies in scientific papers.

Abstract

An emerging area of research in situated and multimodal interactive conversations (SIMMC) includes interactions in scientific papers. Since scientific papers are primarily composed of text, equations, figures, and tables, SIMMC methods must be developed specifically for each component to support the depth of inquiry and interactions required by research scientists. This work introduces Conversational Papers (cPAPERS), a dataset of conversational question-answer pairs from reviews of academic papers grounded in these paper components and their associated references from scientific documents available on arXiv. We present a data collection strategy to collect these question-answer pairs from OpenReview and associate them with contextual information from LaTeX source files. Additionally, we present a series of baseline approaches utilizing Large Language Models (LLMs) in both zero-shot and fine-tuned configurations to address the cPAPERS dataset.
Paper Structure (19 sections, 1 figure, 8 tables)