Table of Contents
Fetching ...

A Multi-Layered Large Language Model Framework for Disease Prediction

Malak Mohamed, Rokaia Emad, Ali Hamdi

TL;DR

The paper addresses disease prediction from noisy Arabic social-health text by proposing a multi-layered LLAMA3-based preprocessing framework that refines, summarizes, and applies NER to user posts before fine-tuning Arabic language models. By evaluating CAMeL-BERT, AraBERT, and Asafaya-BERT with LoRA, the study demonstrates that NER-enhanced preprocessing combined with fine-tuning yields the best performance, achieving up to 83% disease-type accuracy and 69% severity accuracy. Refinement helps Type classification, while summarization offers limited gains, and non-fine-tuned models perform poorly. The approach presents a practical route to improve telehealth diagnostics and symptom assessment from real-world Arabic health data, with potential applicability to other languages and domains.

Abstract

Social telehealth has revolutionized healthcare by enabling patients to share symptoms and receive medical consultations remotely. Users frequently post symptoms on social media and online health platforms, generating a vast repository of medical data that can be leveraged for disease classification and symptom severity assessment. Large language models (LLMs), such as LLAMA3, GPT-3.5 Turbo, and BERT, process complex medical data to enhance disease classification. This study explores three Arabic medical text preprocessing techniques: text summarization, text refinement, and Named Entity Recognition (NER). Evaluating CAMeL-BERT, AraBERT, and Asafaya-BERT with LoRA, the best performance was achieved using CAMeL-BERT with NER-augmented text (83% type classification, 69% severity assessment). Non-fine-tuned models performed poorly (13%-20% type classification, 40%-49% severity assessment). Integrating LLMs into social telehealth systems enhances diagnostic accuracy and treatment outcomes.

A Multi-Layered Large Language Model Framework for Disease Prediction

TL;DR

The paper addresses disease prediction from noisy Arabic social-health text by proposing a multi-layered LLAMA3-based preprocessing framework that refines, summarizes, and applies NER to user posts before fine-tuning Arabic language models. By evaluating CAMeL-BERT, AraBERT, and Asafaya-BERT with LoRA, the study demonstrates that NER-enhanced preprocessing combined with fine-tuning yields the best performance, achieving up to 83% disease-type accuracy and 69% severity accuracy. Refinement helps Type classification, while summarization offers limited gains, and non-fine-tuned models perform poorly. The approach presents a practical route to improve telehealth diagnostics and symptom assessment from real-world Arabic health data, with potential applicability to other languages and domains.

Abstract

Social telehealth has revolutionized healthcare by enabling patients to share symptoms and receive medical consultations remotely. Users frequently post symptoms on social media and online health platforms, generating a vast repository of medical data that can be leveraged for disease classification and symptom severity assessment. Large language models (LLMs), such as LLAMA3, GPT-3.5 Turbo, and BERT, process complex medical data to enhance disease classification. This study explores three Arabic medical text preprocessing techniques: text summarization, text refinement, and Named Entity Recognition (NER). Evaluating CAMeL-BERT, AraBERT, and Asafaya-BERT with LoRA, the best performance was achieved using CAMeL-BERT with NER-augmented text (83% type classification, 69% severity assessment). Non-fine-tuned models performed poorly (13%-20% type classification, 40%-49% severity assessment). Integrating LLMs into social telehealth systems enhances diagnostic accuracy and treatment outcomes.

Paper Structure

This paper contains 17 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Proposed Multi-Layered Framework for Enhancing Arabic Language Model Fine-Tuning with LLAMA3 Preprocessing
  • Figure 2: Distribution of condition types in the dataset, illustrating the diversity of medical issues represented.
  • Figure 3: Distribution of severity levels in the dataset, showing the balance between mild and severe cases.