Advancements in eHealth Data Analytics through Natural Language Processing and Deep Learning
Elena-Simona Apostol, Ciprian-Octavian Truică
TL;DR
The chapter surveys NLP and deep learning methods for leveraging eHealth data, addressing both textual and image modalities. It details a pipeline of preprocessing, vector-space representations, and deployment, comparing domain-adapted embeddings (e.g., BioBERT/ClinicalBERT) with traditional models for tasks like NER and relation extraction. It also covers deep learning architectures for medical images, emphasizing CNNs, LSTMs, and Boltzmann-machine variants, and discusses evaluation, validation, and expert-driven feedback essential for clinical applicability. The discussion highlights challenges from data privacy and heterogeneity, and outlines directions for future research to enhance diagnostic support and disease prevention through more effective knowledge extraction from multimodal eHealth data.
Abstract
The healthcare environment is commonly referred to as "information-rich" but also "knowledge poor". Healthcare systems collect huge amounts of data from various sources: lab reports, medical letters, logs of medical tools or programs, medical prescriptions, etc. These massive sets of data can provide great knowledge and information that can improve the medical services, and overall the healthcare domain, such as disease prediction by analyzing the patient's symptoms or disease prevention, by facilitating the discovery of behavioral factors for diseases. Unfortunately, only a relatively small volume of the textual eHealth data is processed and interpreted, an important factor being the difficulty in efficiently performing Big Data operations. In the medical field, detecting domain-specific multi-word terms is a crucial task as they can define an entire concept with a few words. A term can be defined as a linguistic structure or a concept, and it is composed of one or more words with a specific meaning to a domain. All the terms of a domain create its terminology. This chapter offers a critical study of the current, most performant solutions for analyzing unstructured (image and textual) eHealth data. This study also provides a comparison of the current Natural Language Processing and Deep Learning techniques in the eHealth context. Finally, we examine and discuss some of the current issues, and we define a set of research directions in this area.
