A Multi-Layered Large Language Model Framework for Disease Prediction
Malak Mohamed, Rokaia Emad, Ali Hamdi
TL;DR
The paper addresses disease prediction from noisy Arabic social-health text by proposing a multi-layered LLAMA3-based preprocessing framework that refines, summarizes, and applies NER to user posts before fine-tuning Arabic language models. By evaluating CAMeL-BERT, AraBERT, and Asafaya-BERT with LoRA, the study demonstrates that NER-enhanced preprocessing combined with fine-tuning yields the best performance, achieving up to 83% disease-type accuracy and 69% severity accuracy. Refinement helps Type classification, while summarization offers limited gains, and non-fine-tuned models perform poorly. The approach presents a practical route to improve telehealth diagnostics and symptom assessment from real-world Arabic health data, with potential applicability to other languages and domains.
Abstract
Social telehealth has revolutionized healthcare by enabling patients to share symptoms and receive medical consultations remotely. Users frequently post symptoms on social media and online health platforms, generating a vast repository of medical data that can be leveraged for disease classification and symptom severity assessment. Large language models (LLMs), such as LLAMA3, GPT-3.5 Turbo, and BERT, process complex medical data to enhance disease classification. This study explores three Arabic medical text preprocessing techniques: text summarization, text refinement, and Named Entity Recognition (NER). Evaluating CAMeL-BERT, AraBERT, and Asafaya-BERT with LoRA, the best performance was achieved using CAMeL-BERT with NER-augmented text (83% type classification, 69% severity assessment). Non-fine-tuned models performed poorly (13%-20% type classification, 40%-49% severity assessment). Integrating LLMs into social telehealth systems enhances diagnostic accuracy and treatment outcomes.
