Table of Contents
Fetching ...

ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering

Xiuying Chen, Tairan Wang, Taicheng Guo, Kehan Guo, Juexiao Zhou, Haoyang Li, Mingchen Zhuge, Jürgen Schmidhuber, Xin Gao, Xiangliang Zhang

TL;DR

<3-5 sentence high-level summary> This work addresses the need for domain-specific question answering in chemistry by introducing ScholarChemQA, a large-scale dataset of QA pairs derived from scholarly chemical literature, including substantial unlabeled data and imbalanced class distributions. To tackle these challenges, the authors propose QAMatch, a semi-supervised QA model that incorporates label rebalance, pseudo-label calibration, and SoftMix latent-space augmentation to leverage unlabeled data and address minority classes. Empirical results show that QAMatch outperforms strong baselines and large language models on ScholarChemQA and four benchmark datasets, demonstrating strong domain adaptation and robustness to imbalance. The dataset and method provide valuable resources for advancing chemical QA, enabling more accurate extraction of research insights from chemistry literature and related materials.

Abstract

Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth. While QA datasets are plentiful in areas like general domain and biomedicine, academic chemistry is less explored. Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format. Addressing this gap, we introduce ScholarChemQA, a large-scale QA dataset constructed from chemical papers. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. Correspondingly, we introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data. We first address the issue of imbalanced label distribution by re-weighting the instance-wise loss based on the inverse frequency of each class, ensuring minority classes are not dominated by majority ones during optimization. Next, we utilize the unlabeled data to enrich the learning process, generating a variety of augmentations based on a SoftMix operation and ensuring their predictions align with the same target, i.e., pseudo-labels. To ensure the quality of the pseudo-labels, we propose a calibration procedure aimed at closely aligning the pseudo-label estimates of individual samples with a desired ground truth distribution. Experiments show that our QAMatch significantly outperforms the recent similar-scale baselines and Large Language Models (LLMs) not only on our ScholarChemQA dataset but also on four benchmark datasets. We hope our benchmark and model can facilitate and promote more research on chemical QA.

ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering

TL;DR

<3-5 sentence high-level summary> This work addresses the need for domain-specific question answering in chemistry by introducing ScholarChemQA, a large-scale dataset of QA pairs derived from scholarly chemical literature, including substantial unlabeled data and imbalanced class distributions. To tackle these challenges, the authors propose QAMatch, a semi-supervised QA model that incorporates label rebalance, pseudo-label calibration, and SoftMix latent-space augmentation to leverage unlabeled data and address minority classes. Empirical results show that QAMatch outperforms strong baselines and large language models on ScholarChemQA and four benchmark datasets, demonstrating strong domain adaptation and robustness to imbalance. The dataset and method provide valuable resources for advancing chemical QA, enabling more accurate extraction of research insights from chemistry literature and related materials.

Abstract

Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth. While QA datasets are plentiful in areas like general domain and biomedicine, academic chemistry is less explored. Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format. Addressing this gap, we introduce ScholarChemQA, a large-scale QA dataset constructed from chemical papers. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. Correspondingly, we introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data. We first address the issue of imbalanced label distribution by re-weighting the instance-wise loss based on the inverse frequency of each class, ensuring minority classes are not dominated by majority ones during optimization. Next, we utilize the unlabeled data to enrich the learning process, generating a variety of augmentations based on a SoftMix operation and ensuring their predictions align with the same target, i.e., pseudo-labels. To ensure the quality of the pseudo-labels, we propose a calibration procedure aimed at closely aligning the pseudo-label estimates of individual samples with a desired ground truth distribution. Experiments show that our QAMatch significantly outperforms the recent similar-scale baselines and Large Language Models (LLMs) not only on our ScholarChemQA dataset but also on four benchmark datasets. We hope our benchmark and model can facilitate and promote more research on chemical QA.
Paper Structure (26 sections, 8 equations, 10 figures, 6 tables)

This paper contains 26 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Comparison of general domain QA dataset BoolQ, chemical domain dataset KGQA, and our ScholarChemQA dataset. Our dataset sources from real-world research questions, in contrast to previous chemical datasets that were artificially constructed. Our dataset contains text rich in domain-specific information, making it highly suitable for evaluation.
  • Figure 2: (a) Illustration of data crawling process. (b) Topic distribution of ScholarChemQA. (c) Proportional relationships between corresponding question types and reasoning types. Different question types correspond to different reasoning types, showcasing the diversity of our dataset. 71.5% of the questions require chemical knowledge for answering, showing the difficulty of our chemical question-answering tasks.
  • Figure 3: QAMatch is trained using both labeled and unlabeled data. In the supervised training phase, label rebalancing is applied to adjust the loss regarding class infrequency. In the unsupervised phase, pseudo-labels are generated through pseudo-label calibration. The learning from unlabeled data is through the enforcement of consistency between the pseudo-labels and the predictions of instances augmented using SoftMix.
  • Figure 4: Error analysis. hlcfoobleudefrance!30 Supporting fact for the answer is highlighted.
  • Figure 5: The accuracy (%) and F1 scores (%) of our model and LLMs on the ScholarChemQA dataset.
  • ...and 5 more figures