Table of Contents
Fetching ...

Application of NotebookLM, a Large Language Model with Retrieval-Augmented Generation, for Lung Cancer Staging

Ryota Tozuka, Hisashi Johno, Akitomo Amakawa, Junichi Sato, Mizuki Muto, Shoichiro Seki, Atsushi Komaba, Hiroshi Onishi

TL;DR

The study tackles the reliability of large language models in radiology by evaluating a retrieval-augmented generation (RAG) LLM, NotebookLM, for staging lung cancer using the Japanese TNM guidelines as reliable external knowledge (REK). Across 100 fictional CT-based cases, NotebookLM with REK achieved 86% diagnostic accuracy, outperforming GPT-4o with REK (39%) and without REK (25%), and demonstrated 95% accuracy in identifying the correct REK reference locations. The results suggest that RAG-LLMs can reduce hallucinations and provide verifiable, source-backed outputs, enhancing trust in image-based diagnoses. However, the study relies on fictional data and English-language guidelines, highlighting the need for offline deployment, broader validation, and assessment across diverse LLMs before clinical adoption.

Abstract

Purpose: In radiology, large language models (LLMs), including ChatGPT, have recently gained attention, and their utility is being rapidly evaluated. However, concerns have emerged regarding their reliability in clinical applications due to limitations such as hallucinations and insufficient referencing. To address these issues, we focus on the latest technology, retrieval-augmented generation (RAG), which enables LLMs to reference reliable external knowledge (REK). Specifically, this study examines the utility and reliability of a recently released RAG-equipped LLM (RAG-LLM), NotebookLM, for staging lung cancer. Materials and methods: We summarized the current lung cancer staging guideline in Japan and provided this as REK to NotebookLM. We then tasked NotebookLM with staging 100 fictional lung cancer cases based on CT findings and evaluated its accuracy. For comparison, we performed the same task using a gold-standard LLM, GPT-4 Omni (GPT-4o), both with and without the REK. Results: NotebookLM achieved 86% diagnostic accuracy in the lung cancer staging experiment, outperforming GPT-4o, which recorded 39% accuracy with the REK and 25% without it. Moreover, NotebookLM demonstrated 95% accuracy in searching reference locations within the REK. Conclusion: NotebookLM successfully performed lung cancer staging by utilizing the REK, demonstrating superior performance compared to GPT-4o. Additionally, it provided highly accurate reference locations within the REK, allowing radiologists to efficiently evaluate the reliability of NotebookLM's responses and detect possible hallucinations. Overall, this study highlights the potential of NotebookLM, a RAG-LLM, in image diagnosis.

Application of NotebookLM, a Large Language Model with Retrieval-Augmented Generation, for Lung Cancer Staging

TL;DR

The study tackles the reliability of large language models in radiology by evaluating a retrieval-augmented generation (RAG) LLM, NotebookLM, for staging lung cancer using the Japanese TNM guidelines as reliable external knowledge (REK). Across 100 fictional CT-based cases, NotebookLM with REK achieved 86% diagnostic accuracy, outperforming GPT-4o with REK (39%) and without REK (25%), and demonstrated 95% accuracy in identifying the correct REK reference locations. The results suggest that RAG-LLMs can reduce hallucinations and provide verifiable, source-backed outputs, enhancing trust in image-based diagnoses. However, the study relies on fictional data and English-language guidelines, highlighting the need for offline deployment, broader validation, and assessment across diverse LLMs before clinical adoption.

Abstract

Purpose: In radiology, large language models (LLMs), including ChatGPT, have recently gained attention, and their utility is being rapidly evaluated. However, concerns have emerged regarding their reliability in clinical applications due to limitations such as hallucinations and insufficient referencing. To address these issues, we focus on the latest technology, retrieval-augmented generation (RAG), which enables LLMs to reference reliable external knowledge (REK). Specifically, this study examines the utility and reliability of a recently released RAG-equipped LLM (RAG-LLM), NotebookLM, for staging lung cancer. Materials and methods: We summarized the current lung cancer staging guideline in Japan and provided this as REK to NotebookLM. We then tasked NotebookLM with staging 100 fictional lung cancer cases based on CT findings and evaluated its accuracy. For comparison, we performed the same task using a gold-standard LLM, GPT-4 Omni (GPT-4o), both with and without the REK. Results: NotebookLM achieved 86% diagnostic accuracy in the lung cancer staging experiment, outperforming GPT-4o, which recorded 39% accuracy with the REK and 25% without it. Moreover, NotebookLM demonstrated 95% accuracy in searching reference locations within the REK. Conclusion: NotebookLM successfully performed lung cancer staging by utilizing the REK, demonstrating superior performance compared to GPT-4o. Additionally, it provided highly accurate reference locations within the REK, allowing radiologists to efficiently evaluate the reliability of NotebookLM's responses and detect possible hallucinations. Overall, this study highlights the potential of NotebookLM, a RAG-LLM, in image diagnosis.

Paper Structure

This paper contains 6 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: An overview of the experimental process. Radiologists from our team created CT findings for 100 fictional lung cancer patients, and each patient's TNM classification was diagnosed by the different LLM settings (NotebookLM with REK, GPT-4o with REK, and GPT-4o without REK). Our team's radiologists evaluated these diagnoses and calculated their diagnostic accuracies. For NotebookLM, since it searches and explicitly presents reference locations within REK as the basis for its answers, we also assessed the appropriateness of these locations (search accuracy). REK=reliable external knowledge.
  • Figure 2: Diagnostic accuracies of TNM classifications for each LLM setting (i.e., NotebookLM with REK, GPT-4o with REK, or GPT-4o without REK) in the experiment using 100 fictional lung cancer cases. A diagnosis of TNM classification was considered correct if all the T, N, and M factors were correctly diagnosed. For NotebookLM with REK, search accuracy was also calculated as the percentage of lung cancer cases in which NotebookLM referenced the appropriate locations within the REK. REK=reliable external knowledge.
  • Figure 3: A representative result from the lung cancer staging experiment using LLMs. The sources 1 to 4 referenced in the answer by NotebookLM with REK are available in Online Resource 3. LLM=large language model, REK=reliable external knowledge.
  • Figure 4: Diagnostic accuracies of the LLMs (NotebookLM with REK, GPT-4o with REK, and GPT-4o without REK) for each of the T, N, and M factors in the experiment with 100 fictional lung cancer patients. REK=reliable external knowledge.
  • Figure 5: An experimental result where NotebookLM made an incorrect numerical comparison. The sources 1 to 4 referenced in the answer are available in Online Resource 3 (which happened to be the same as the case in \ref{['repr']}). REK=reliable external knowledge.