Table of Contents
Fetching ...

Clinical Reading Comprehension with Encoder-Decoder Models Enhanced by Direct Preference Optimization

Md Sultan Al Nahian, Ramakanth Kavuluru

TL;DR

The paper addresses extracting answers from clinical radiology notes by applying encoder-decoder transformers enhanced with Direct Preference Optimization (DPO) to RadQA. It demonstrates that encoder-decoder models outperform prior BERT-based baselines by over 10 F1 points, and that further DPO-based fine-tuning yields an additional 1–3 F1 gains, totaling 12–15 points over the previous state-of-the-art. A key contribution is automatically generating high-quality preference data (without human input) using model-based and rule-based strategies, and analyzing how factors like model size and negative-data diversity influence improvements. The work highlights DPO as an effective and computationally efficient alternative to RLHF for information-extraction tasks, with practical implications for improving radiology reading comprehension systems and potentially extending to other clinical NLP tasks.

Abstract

Extractive question answering over clinical text is a crucial need to help deal with the deluge of clinical text generated in hospitals. While encoder models (e.g., BERT) have been popular for this reading comprehension task, recently encoder-decoder models (e.g., T5) are on the rise. There is also the emergence of preference optimization techniques to align decoder-only LLMs with human preferences. In this paper, we combine encoder-decoder models with the direct preference optimization (DPO) method to improve over prior state of the art for the RadQA radiology question answering task by 12-15 F1 points. To the best of our knowledge, this effort is the first to show that DPO method also works for reading comprehension via novel heuristics to generate preference data without human inputs.

Clinical Reading Comprehension with Encoder-Decoder Models Enhanced by Direct Preference Optimization

TL;DR

The paper addresses extracting answers from clinical radiology notes by applying encoder-decoder transformers enhanced with Direct Preference Optimization (DPO) to RadQA. It demonstrates that encoder-decoder models outperform prior BERT-based baselines by over 10 F1 points, and that further DPO-based fine-tuning yields an additional 1–3 F1 gains, totaling 12–15 points over the previous state-of-the-art. A key contribution is automatically generating high-quality preference data (without human input) using model-based and rule-based strategies, and analyzing how factors like model size and negative-data diversity influence improvements. The work highlights DPO as an effective and computationally efficient alternative to RLHF for information-extraction tasks, with practical implications for improving radiology reading comprehension systems and potentially extending to other clinical NLP tasks.

Abstract

Extractive question answering over clinical text is a crucial need to help deal with the deluge of clinical text generated in hospitals. While encoder models (e.g., BERT) have been popular for this reading comprehension task, recently encoder-decoder models (e.g., T5) are on the rise. There is also the emergence of preference optimization techniques to align decoder-only LLMs with human preferences. In this paper, we combine encoder-decoder models with the direct preference optimization (DPO) method to improve over prior state of the art for the RadQA radiology question answering task by 12-15 F1 points. To the best of our knowledge, this effort is the first to show that DPO method also works for reading comprehension via novel heuristics to generate preference data without human inputs.
Paper Structure (28 sections, 4 equations, 3 figures, 3 tables)

This paper contains 28 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Pipeline of fine-tuning the language model using DPO. $\pi_{\theta}$ is the language model we want to fine-tune, and $\pi_{ref}$ is the reference model, which is kept frozen during the fine-tuning process. Both models are initialized with the SFT model.
  • Figure 2: Examples of negative (rejected) outputs created by rules.
  • Figure 3: Performance comparison of DPO-T5-3b model with varying training examples and preference datasets generated using different thresholds. X-axis plots the number of training examples, Y-axis is the F1 score, and the line colors represent different preference datasets created by applying three different f1 threshold.