Table of Contents
Fetching ...

Decoding MIE: A Novel Dataset Approach Using Topic Extraction and Affiliation Parsing

Ehsan Bitaraf, Maryam Jafarpour

TL;DR

This work introduces a novel JSON dataset derived from Medical Informatics Europe conference proceedings (1996–2024), enriched with topic extraction via TextRank and affiliation parsing, enabling longitudinal bibliometric analyses in medical informatics. Data are harvested through the Triple-A workflow from PubMed and include 4,606 articles, with a reproducible five-step update protocol hosted on GitHub. The study analyzes DOI consistency, citation patterns, authorship and affiliation anomalies, and language diversity to highlight data quality and historical publishing practices. By providing a structured, richly annotated resource, the dataset supports trend analysis, collaboration-network mapping, curriculum development, and policy-relevant insights for the medical informatics community.

Abstract

The rapid expansion of medical informatics literature presents significant challenges in synthesizing and analyzing research trends. This study introduces a novel dataset derived from the Medical Informatics Europe (MIE) Conference proceedings, addressing the need for sophisticated analytical tools in the field. Utilizing the Triple-A software, we extracted and processed metadata and abstract from 4,606 articles published in the "Studies in Health Technology and Informatics" journal series, focusing on MIE conferences from 1996 onwards. Our methodology incorporated advanced techniques such as affiliation parsing using the TextRank algorithm. The resulting dataset, available in JSON format, offers a comprehensive view of bibliometric details, extracted topics, and standardized affiliation information. Analysis of this data revealed interesting patterns in Digital Object Identifier usage, citation trends, and authorship attribution across the years. Notably, we observed inconsistencies in author data and a brief period of linguistic diversity in publications. This dataset represents a significant contribution to the medical informatics community, enabling longitudinal studies of research trends, collaboration network analyses, and in-depth bibliometric investigations. By providing this enriched, structured resource spanning nearly three decades of conference proceedings, we aim to facilitate novel insights and advancements in the rapidly evolving field of medical informatics.

Decoding MIE: A Novel Dataset Approach Using Topic Extraction and Affiliation Parsing

TL;DR

This work introduces a novel JSON dataset derived from Medical Informatics Europe conference proceedings (1996–2024), enriched with topic extraction via TextRank and affiliation parsing, enabling longitudinal bibliometric analyses in medical informatics. Data are harvested through the Triple-A workflow from PubMed and include 4,606 articles, with a reproducible five-step update protocol hosted on GitHub. The study analyzes DOI consistency, citation patterns, authorship and affiliation anomalies, and language diversity to highlight data quality and historical publishing practices. By providing a structured, richly annotated resource, the dataset supports trend analysis, collaboration-network mapping, curriculum development, and policy-relevant insights for the medical informatics community.

Abstract

The rapid expansion of medical informatics literature presents significant challenges in synthesizing and analyzing research trends. This study introduces a novel dataset derived from the Medical Informatics Europe (MIE) Conference proceedings, addressing the need for sophisticated analytical tools in the field. Utilizing the Triple-A software, we extracted and processed metadata and abstract from 4,606 articles published in the "Studies in Health Technology and Informatics" journal series, focusing on MIE conferences from 1996 onwards. Our methodology incorporated advanced techniques such as affiliation parsing using the TextRank algorithm. The resulting dataset, available in JSON format, offers a comprehensive view of bibliometric details, extracted topics, and standardized affiliation information. Analysis of this data revealed interesting patterns in Digital Object Identifier usage, citation trends, and authorship attribution across the years. Notably, we observed inconsistencies in author data and a brief period of linguistic diversity in publications. This dataset represents a significant contribution to the medical informatics community, enabling longitudinal studies of research trends, collaboration network analyses, and in-depth bibliometric investigations. By providing this enriched, structured resource spanning nearly three decades of conference proceedings, we aim to facilitate novel insights and advancements in the rapidly evolving field of medical informatics.
Paper Structure (29 sections, 5 figures, 2 tables)

This paper contains 29 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Articles with and without DOI by Year
  • Figure 2: Articles with No Citations vs At Least One Citation by Year
  • Figure 3: Articles with No Authors vs At Least One Author by Year
  • Figure 4: Articles with at least one incomplete affiliation parsing
  • Figure 5: English vs Non-English Articles by Year