Table of Contents
Fetching ...

The Software Observatory: aggregating and analysing software metadata for trend computation and FAIR assessment

Eva Martín del Pico, Josep Lluís Gelpí, Salvador Capella-Gutiérrez

TL;DR

The paper addresses the challenge of fragmented and inconsistent software metadata in Life Sciences by introducing the Software Observatory, a scalable platform that aggregates metadata from diverse registries, enriches and normalises it, and provides automated FAIR assessments through the FAIRsoft Evaluator. Its modular pipeline performs ingestion, EDAM/SPDX harmonisation, external enrichment, and a multi-stage disambiguation process (conservative grouping, conflict detection, rescue heuristics, and LLM-assisted resolution) to produce a deduplicated metadata corpus. The authors demonstrate the approach with a Proteomics case study, revealing strong Findability and licensing gaps, and they discuss how the FAIRsoft Evaluator supports improvement workflows while enabling actionable insights for developers, curators, and policy-makers. The work highlights the Observatory’s potential to guide better software metadata practices, while outlining limitations and future directions such as author disambiguation, document-based metadata mining, and improved visualization of indicator weights, with practical implications for sustainability and FAIR adherence in research software.

Abstract

In the ever-changing realm of research software development, it is crucial for the scientific community to grasp current trends to identify gaps that can potentially hinder scientific progress. The adherence to the FAIR (Findable, Accessible, Interoperable, Reusable) principles can serve as a proxy to understand those trends and provide a mechanism to propose specific actions. The Software Observatory at OpenEBench (https://openebench.bsc.es/observatory) is a novel web portal that consolidates software metadata from various sources, offering comprehensive insights into critical research software aspects. Our platform enables users to analyse trends, identify patterns and advancements within the Life Sciences research software ecosystem, and understand its evolution over time. It also evaluates research software according to FAIR principles for research software, providing scores for different indicators. Users have the ability to visualise this metadata at different levels of granularity, ranging from the entire software landscape to specific communities to individual software entries through the FAIRsoft Evaluator. Indeed, the FAIRsoft Evaluator component streamlines the assessment process, helping developers efficiently evaluate and obtain guidance to improve their software's FAIRness. The Software Observatory represents a valuable resource for researchers and software developers, as well as stakeholders, promoting better software development practices and adherence to FAIR principles for research software.

The Software Observatory: aggregating and analysing software metadata for trend computation and FAIR assessment

TL;DR

The paper addresses the challenge of fragmented and inconsistent software metadata in Life Sciences by introducing the Software Observatory, a scalable platform that aggregates metadata from diverse registries, enriches and normalises it, and provides automated FAIR assessments through the FAIRsoft Evaluator. Its modular pipeline performs ingestion, EDAM/SPDX harmonisation, external enrichment, and a multi-stage disambiguation process (conservative grouping, conflict detection, rescue heuristics, and LLM-assisted resolution) to produce a deduplicated metadata corpus. The authors demonstrate the approach with a Proteomics case study, revealing strong Findability and licensing gaps, and they discuss how the FAIRsoft Evaluator supports improvement workflows while enabling actionable insights for developers, curators, and policy-makers. The work highlights the Observatory’s potential to guide better software metadata practices, while outlining limitations and future directions such as author disambiguation, document-based metadata mining, and improved visualization of indicator weights, with practical implications for sustainability and FAIR adherence in research software.

Abstract

In the ever-changing realm of research software development, it is crucial for the scientific community to grasp current trends to identify gaps that can potentially hinder scientific progress. The adherence to the FAIR (Findable, Accessible, Interoperable, Reusable) principles can serve as a proxy to understand those trends and provide a mechanism to propose specific actions. The Software Observatory at OpenEBench (https://openebench.bsc.es/observatory) is a novel web portal that consolidates software metadata from various sources, offering comprehensive insights into critical research software aspects. Our platform enables users to analyse trends, identify patterns and advancements within the Life Sciences research software ecosystem, and understand its evolution over time. It also evaluates research software according to FAIR principles for research software, providing scores for different indicators. Users have the ability to visualise this metadata at different levels of granularity, ranging from the entire software landscape to specific communities to individual software entries through the FAIRsoft Evaluator. Indeed, the FAIRsoft Evaluator component streamlines the assessment process, helping developers efficiently evaluate and obtain guidance to improve their software's FAIRness. The Software Observatory represents a valuable resource for researchers and software developers, as well as stakeholders, promoting better software development practices and adherence to FAIR principles for research software.

Paper Structure

This paper contains 18 sections, 5 figures.

Figures (5)

  • Figure 1: Software Observatory metadata processing pipeline. Metadata is ingested from external sources into a raw collection, enriched and normalized through automated methods, and integrated into a deduplicated final dataset. Internal enrichment includes SPDX license mapping, EDAM format normalization, and contributor classification, while auxiliary metadata (e.g., publication data from Europe PMC and Semantic Scholar, and service availability) is retrieved via decoupled pipelines. The integration step involved grouping records into blocks, identifying potential conflicts within each block, and resolving them through a combination of heuristic rules, LLM-based assessments, and human validation. The block structure used for conflict resolution is stored in a dedicated, persistent state file, which is updated after each resolution step and acts as the source of truth for the merged collection. Once resolved, records within each block were merged into unified entries, resulting in an independent, deduplicated collection that constitutes the final output of the pipeline. This layered architecture separates ingestion, enrichment, and integration stages, enabling independent evolution of each component. Intermediate layers store enriched-but-unmerged entries to support traceable, incremental updates without re-importing the full dataset. Manual disambiguation and logic improvements (e.g., parsing heuristics) can be applied at the enrichment stage, enhancing flexibility, reproducibility, and adaptation to evolving metadata standards. The block file, maintained as a persistent source of truth, captures the grouping logic used in integration and ensures consistency between conflict resolution and the merged collection. Its persistence allows for transparent correction, re-evaluation, and downstream reproducibility
  • Figure 2: Functional components of the Software Observatory user interface. The platform supports exploration of the full software collection, specific communities or projects, and individual entries. Visual dashboards display trends, data aggregation details, and FAIRsoft scores. Individual software entries can be evaluated through the FAIRsoft Evaluator in three steps: (1) metadata review and completion, (2) FAIR assessment, and (3) export of metadata in formats such as .CFF (citation file) and maSMP (a structured JSON-LD profile compatible with CodeMeta, Bioschemas, and schema.org). The tool supports both local downloads and direct GitHub integration via pull requests, facilitating metadata improvement and reuse.
  • Figure 3: Sources of metadata integrated into the Software Observatory. Observatory aggregates software metadata from a range of initial metadata sources (green), including registries such as bio.tools, Bioconda, the Galaxy ToolShed, Bioconductor, SourceForge, Galaxy Europe, and linked GitHub repositories. These records are subsequently enriched using external enrichment sources (light blue), including Semantic Scholar and Europe PMC for publication metadata, and direct service availability checks for deployable tools. These enrichment processes rely on identifiers (e.g., DOIs, service URLs, and software type) already present in the collected software metadata. This figure shows the contribution of each external source to the final integrated dataset. Each square corresponds to approximately 700 software metadata records. The structure of the figure mirrors the layered, dependency-aware integration architecture outlined in Figure \ref{['fig:pipeline']}.
  • Figure 4: Disambiguation pipeline and conflict resolution outcomes. Top: metadata entries are grouped and flagged as potential conflicts, which are resolved using a hybrid approach of LLM-based agreement proxies and human curation. Bottom: distribution of entries, showing proportions resolved automatically, escalated, or discarded.
  • Figure 5: Interactive visualization of (meta)data completeness and software types for the Proteomics community in the Software Observatory.