Table of Contents
Fetching ...

BioTABQA: Instruction Learning for Biomedical Table Question Answering

Man Luo, Sharad Saxena, Swaroop Mishra, Mihir Parmar, Chitta Baral

TL;DR

BioTABQA is the first biomedical table question answering benchmark built from a differential-diagnosis textbook, designed to test generalization across unseen questions via three data splits. The authors develop an instruction-tuned multitask framework and compare it against single-task and multitask baselines, using table linearization and prompt-based inputs with DistilBERT. They find that multitask learning improves performance, and instruction tuning yields additional gains, especially in cross-task scenarios, demonstrating enhanced generalization. The work highlights the potential of instruction-based strategies for biomedical table QA and suggests directions for more naturalistic data and broader biomedical contexts.

Abstract

Table Question Answering (TQA) is an important but under-explored task. Most of the existing QA datasets are in unstructured text format and only few of them use tables as the context. To the best of our knowledge, none of TQA datasets exist in the biomedical domain where tables are frequently used to present information. In this paper, we first curate a table question answering dataset, BioTABQA, using 22 templates and the context from a biomedical textbook on differential diagnosis. BioTABQA can not only be used to teach a model how to answer questions from tables but also evaluate how a model generalizes to unseen questions, an important scenario for biomedical applications. To achieve the generalization evaluation, we divide the templates into 17 training and 5 cross-task evaluations. Then, we develop two baselines using single and multi-tasks learning on BioTABQA. Furthermore, we explore instructional learning, a recent technique showing impressive generalizing performance. Experimental results show that our instruction-tuned model outperforms single and multi-task baselines on an average by ~23% and ~6% across various evaluation settings, and more importantly, instruction-tuned model outperforms baselines by ~5% on cross-tasks.

BioTABQA: Instruction Learning for Biomedical Table Question Answering

TL;DR

BioTABQA is the first biomedical table question answering benchmark built from a differential-diagnosis textbook, designed to test generalization across unseen questions via three data splits. The authors develop an instruction-tuned multitask framework and compare it against single-task and multitask baselines, using table linearization and prompt-based inputs with DistilBERT. They find that multitask learning improves performance, and instruction tuning yields additional gains, especially in cross-task scenarios, demonstrating enhanced generalization. The work highlights the potential of instruction-based strategies for biomedical table QA and suggests directions for more naturalistic data and broader biomedical contexts.

Abstract

Table Question Answering (TQA) is an important but under-explored task. Most of the existing QA datasets are in unstructured text format and only few of them use tables as the context. To the best of our knowledge, none of TQA datasets exist in the biomedical domain where tables are frequently used to present information. In this paper, we first curate a table question answering dataset, BioTABQA, using 22 templates and the context from a biomedical textbook on differential diagnosis. BioTABQA can not only be used to teach a model how to answer questions from tables but also evaluate how a model generalizes to unseen questions, an important scenario for biomedical applications. To achieve the generalization evaluation, we divide the templates into 17 training and 5 cross-task evaluations. Then, we develop two baselines using single and multi-tasks learning on BioTABQA. Furthermore, we explore instructional learning, a recent technique showing impressive generalizing performance. Experimental results show that our instruction-tuned model outperforms single and multi-task baselines on an average by ~23% and ~6% across various evaluation settings, and more importantly, instruction-tuned model outperforms baselines by ~5% on cross-tasks.
Paper Structure (20 sections, 2 figures, 9 tables)

This paper contains 20 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: The average performance of three models on the in-domain testing sets of different Splits.
  • Figure 2: The average performance of three models on the cross-tasks of different Splits.