Table of Contents
Fetching ...

Adaptive and Multi-Source Entity Matching for Name Standardization of Astronomical Observation Facilities

Liza Fretel, Baptiste Cecconi, Laura Debisschop

TL;DR

This work tackles inconsistent naming of astronomical observation facilities by building a multi-source mapping pipeline that aligns facilities across vocabularies and produces a single standardized label per entity. It combines external identifier linking, surface and semantic similarity scoring, and iterative LLM-based validation, encapsulated in a weighted global score $score(p)$ to rank candidate mappings. An Elasticsearch-based name resolver API and SKOS/SSSOM outputs enable integration with IVOA vocabularies and OntoPortal-Astro, supporting interoperable data discovery across major astronomy data ecosystems. The approach demonstrates promising validation results on a curated pair set and outlines plans to scale vocabularies, improve LM-based validation, and introduce human-in-the-loop information retrieval for uncertain cases.

Abstract

This ongoing work focuses on the development of a methodology for generating a multi-source mapping of astronomical observation facilities. To compare two entities, we compute scores with adaptable criteria and Natural Language Processing (NLP) techniques (Bag-of-Words approaches, sequential approaches, and surface approaches) to map entities extracted from eight semantic artifacts, including Wikidata and astronomy-oriented resources. We utilize every property available, such as labels, definitions, descriptions, external identifiers, and more domain-specific properties, such as the observation wavebands, spacecraft launch dates, funding agencies, etc. Finally, we use a Large Language Model (LLM) to accept or reject a mapping suggestion and provide a justification, ensuring the plausibility and FAIRness of the validated synonym pairs. The resulting mapping is composed of multi-source synonym sets providing only one standardized label per entity. Those mappings will be used to feed our Name Resolver API and will be integrated into the International Virtual Observatory Alliance (IVOA) Vocabularies and the OntoPortal-Astro platform.

Adaptive and Multi-Source Entity Matching for Name Standardization of Astronomical Observation Facilities

TL;DR

This work tackles inconsistent naming of astronomical observation facilities by building a multi-source mapping pipeline that aligns facilities across vocabularies and produces a single standardized label per entity. It combines external identifier linking, surface and semantic similarity scoring, and iterative LLM-based validation, encapsulated in a weighted global score to rank candidate mappings. An Elasticsearch-based name resolver API and SKOS/SSSOM outputs enable integration with IVOA vocabularies and OntoPortal-Astro, supporting interoperable data discovery across major astronomy data ecosystems. The approach demonstrates promising validation results on a curated pair set and outlines plans to scale vocabularies, improve LM-based validation, and introduce human-in-the-loop information retrieval for uncertain cases.

Abstract

This ongoing work focuses on the development of a methodology for generating a multi-source mapping of astronomical observation facilities. To compare two entities, we compute scores with adaptable criteria and Natural Language Processing (NLP) techniques (Bag-of-Words approaches, sequential approaches, and surface approaches) to map entities extracted from eight semantic artifacts, including Wikidata and astronomy-oriented resources. We utilize every property available, such as labels, definitions, descriptions, external identifiers, and more domain-specific properties, such as the observation wavebands, spacecraft launch dates, funding agencies, etc. Finally, we use a Large Language Model (LLM) to accept or reject a mapping suggestion and provide a justification, ensuring the plausibility and FAIRness of the validated synonym pairs. The resulting mapping is composed of multi-source synonym sets providing only one standardized label per entity. Those mappings will be used to feed our Name Resolver API and will be integrated into the International Virtual Observatory Alliance (IVOA) Vocabularies and the OntoPortal-Astro platform.

Paper Structure

This paper contains 26 sections, 4 equations, 1 figure.

Figures (1)

  • Figure 1: Data processing pipeline. During data updating (left), described in section \ref{['sec:updating']}, we collect observation facilities' records and save them in turtle files. Those files are used as inputs of the entity alignment steps (right), described in sections \ref{['sec:mapping']}, \ref{['sec:scores']} and \ref{['sec:llm_validation']}. This outputs a linked ontology containing all entities from each list with their matching relations (SKOS:exactMatch), along with an associated SSSOM ontology. The application views can be generated from the linked ontology. Their purposes are explained in the subsection \ref{['sec:applications']}.