Cancer-Answer: Empowering Cancer Care with Advanced Large Language Models
Aniket Deroy, Subhankar Maity
TL;DR
The paper addresses the challenge of delivering accurate, timely information for GI tract cancers by leveraging prompted large language models, specifically GPT-3.5 Turbo, in a zero-shot QA framework. It defines a GI cancer-focused QA task, uses a small dataset (30 training, 50 testing) to evaluate prompt-based QA with two metrics (A1: entity overlap; A2: linguistic quality), and demonstrates improvements over three runs. The methodology centers on prompt engineering with three distinct prompts to generate clinically relevant responses, highlighting benefits such as minimal data requirements and rapid deployment for clinical decision support. The findings suggest that prompt-based LLM QA systems can augment diagnostic decision-making and patient education, while acknowledging limitations and the need for further domain-specific refinement and validation.
Abstract
Gastrointestinal (GI) tract cancers account for a substantial portion of the global cancer burden, where early diagnosis is critical for improved management and patient outcomes. The complex aetiologies and overlapping symptoms across GI cancers often delay diagnosis, leading to suboptimal treatment strategies. Cancer-related queries are crucial for timely diagnosis, treatment, and patient education, as access to accurate, comprehensive information can significantly influence outcomes. However, the complexity of cancer as a disease, combined with the vast amount of available data, makes it difficult for clinicians and patients to quickly find precise answers. To address these challenges, we leverage large language models (LLMs) such as GPT-3.5 Turbo to generate accurate, contextually relevant responses to cancer-related queries. Pre-trained with medical data, these models provide timely, actionable insights that support informed decision-making in cancer diagnosis and care, ultimately improving patient outcomes. We calculate two metrics: A1 (which represents the fraction of entities present in the model-generated answer compared to the gold standard) and A2 (which represents the linguistic correctness and meaningfulness of the model-generated answer with respect to the gold standard), achieving maximum values of 0.546 and 0.881, respectively.
