Table of Contents
Fetching ...

UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction

Tianmai M. Zhang, Zhaoyi Sun, Sihang Zeng, Chenxi Li, Neil F. Abernethy, Barbara D. Lam, Fei Xia, Meliha Yetisgen

TL;DR

The paper tackles end-to-end extraction of patient chemotherapy timelines from raw clinical notes (subtask 2 of ChemoTimelines) by evaluating a spectrum of LLM-based strategies. It compares prompting baselines, chain-of-thought reasoning, dictionary-enhanced pipelines, supervised fine-tuning, and direct preference optimization, within a two-step workflow of note-level extraction and timeline aggregation. The strongest results come from fine-tuned dense models (notably 14B Qwen3) with SFT achieving the top test score (0.678), while dictionary-based methods offer high recall and LLM verification improves precision; ensemble approaches provided little benefit. Overall, the work highlights the critical role of timeline aggregation, cost-accuracy trade-offs between prompting and fine-tuning, and avenues for future improvements in clinical timeline extraction.

Abstract

The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2 -- generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.

UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction

TL;DR

The paper tackles end-to-end extraction of patient chemotherapy timelines from raw clinical notes (subtask 2 of ChemoTimelines) by evaluating a spectrum of LLM-based strategies. It compares prompting baselines, chain-of-thought reasoning, dictionary-enhanced pipelines, supervised fine-tuning, and direct preference optimization, within a two-step workflow of note-level extraction and timeline aggregation. The strongest results come from fine-tuned dense models (notably 14B Qwen3) with SFT achieving the top test score (0.678), while dictionary-based methods offer high recall and LLM verification improves precision; ensemble approaches provided little benefit. Overall, the work highlights the critical role of timeline aggregation, cost-accuracy trade-offs between prompting and fine-tuning, and avenues for future improvements in clinical timeline extraction.

Abstract

The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2 -- generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.

Paper Structure

This paper contains 32 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: DPO reward accuracy curve of Qwen3-14B.