Table of Contents
Fetching ...

Monitoring Transformative Technological Convergence Through LLM-Extracted Semantic Entity Triple Graphs

Alexander Sternfeld, Andrei Kucharavy, Dimitri Percia David, Alain Mermoud, Julian Jang-Jaccard, Nathan Monnet

TL;DR

This work tackles the challenge of forecasting transformative ICT technologies by building a scalable, data-driven pipeline that uses LLMs to extract semantic entity triples from full-text sources and to construct a dynamic knowledge graph of technology concepts. It introduces noun stapling and graph-based convergence metrics to detect emerging patterns of technology convergence, and validates the approach on 278,625 arXiv preprints and 9,793 USPTO patent applications, yielding over 53k key terms and 23.8 million triples. The results reveal both established and emerging convergences, such as retrieval-augmented generation and conversational agents, and demonstrate the method's generalizability across scientific and patent data with implications for proactive technology forecasting and policy planning. The proposed framework provides a scalable, interpretable means to monitor transformative potential in fast-moving ICT domains, with practical significance for researchers, industry, and decision-makers.

Abstract

Forecasting transformative technologies remains a critical but challenging task, particularly in fast-evolving domains such as Information and Communication Technologies (ICTs). Traditional expert-based methods struggle to keep pace with short innovation cycles and ambiguous early-stage terminology. In this work, we propose a novel, data-driven pipeline to monitor the emergence of transformative technologies by identifying patterns of technological convergence. Our approach leverages advances in Large Language Models (LLMs) to extract semantic triples from unstructured text and construct a large-scale graph of technology-related entities and relations. We introduce a new method for grouping semantically similar technology terms (noun stapling) and develop graph-based metrics to detect convergence signals. The pipeline includes multi-stage filtering, domain-specific keyword clustering, and a temporal trend analysis of topic co-occurence. We validate our methodology on two complementary datasets: 278,625 arXiv preprints (2017--2024) to capture early scientific signals, and 9,793 USPTO patent applications (2018-2024) to track downstream commercial developments. Our results demonstrate that the proposed pipeline can identify both established and emerging convergence patterns, offering a scalable and generalizable framework for technology forecasting grounded in full-text analysis.

Monitoring Transformative Technological Convergence Through LLM-Extracted Semantic Entity Triple Graphs

TL;DR

This work tackles the challenge of forecasting transformative ICT technologies by building a scalable, data-driven pipeline that uses LLMs to extract semantic entity triples from full-text sources and to construct a dynamic knowledge graph of technology concepts. It introduces noun stapling and graph-based convergence metrics to detect emerging patterns of technology convergence, and validates the approach on 278,625 arXiv preprints and 9,793 USPTO patent applications, yielding over 53k key terms and 23.8 million triples. The results reveal both established and emerging convergences, such as retrieval-augmented generation and conversational agents, and demonstrate the method's generalizability across scientific and patent data with implications for proactive technology forecasting and policy planning. The proposed framework provides a scalable, interpretable means to monitor transformative potential in fast-moving ICT domains, with practical significance for researchers, industry, and decision-makers.

Abstract

Forecasting transformative technologies remains a critical but challenging task, particularly in fast-evolving domains such as Information and Communication Technologies (ICTs). Traditional expert-based methods struggle to keep pace with short innovation cycles and ambiguous early-stage terminology. In this work, we propose a novel, data-driven pipeline to monitor the emergence of transformative technologies by identifying patterns of technological convergence. Our approach leverages advances in Large Language Models (LLMs) to extract semantic triples from unstructured text and construct a large-scale graph of technology-related entities and relations. We introduce a new method for grouping semantically similar technology terms (noun stapling) and develop graph-based metrics to detect convergence signals. The pipeline includes multi-stage filtering, domain-specific keyword clustering, and a temporal trend analysis of topic co-occurence. We validate our methodology on two complementary datasets: 278,625 arXiv preprints (2017--2024) to capture early scientific signals, and 9,793 USPTO patent applications (2018-2024) to track downstream commercial developments. Our results demonstrate that the proposed pipeline can identify both established and emerging convergence patterns, offering a scalable and generalizable framework for technology forecasting grounded in full-text analysis.

Paper Structure

This paper contains 35 sections, 7 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: The complete pipeline for the triple extraction and downstream analysis, starting from either raw patent data or raw arXiv data. The green components of the pipeline reflect the triple extraction procedure, including the pre- and post-processing steps. The grey components of the pipeline illustrate the key-term extraction, noun stapling and the downstream analyses.
  • Figure 2: The number of papers for each of the 15 most common topics in the field of LLMs, based on the extracted and grouped key terms. Each of the bars is divided into the arXiv categories from which the papers originate.
  • Figure 3: Multiple line plot showing for each of the 10 most common topics the number of papers in which at least one triple appears for that topic. The gray dashed line shows the aggregate seasonal component, whereas the colored lines show the trend for each topic.
  • Figure 4: The trends for key sub-technology terms for the 10 most common emerging technology topics in the LLM space, based on the extracted and grouped key terms. The gray dashed line shows the aggregate seasonal component that the key terms share, whereas the colored lines show the trend for each key term.
  • Figure 5: Clusters of topics composed using the Louvain method for community detection. Relations in red occur over 70% of the time in 2022 or later, whereas relations in blue occur over 70% of the time in 2021 or earlier. All other relations are in black. Node/edge sizes correspond to absolute frequency, and text size to eigenvector network centrality.
  • ...and 4 more figures