Table of Contents
Fetching ...

Data-Driven Information Extraction and Enrichment of Molecular Profiling Data for Cancer Cell Lines

Ellery Smith, Rahel Paloots, Dimitris Giagkos, Michael Baudis, Kurt Stockinger

TL;DR

This work tackles the challenge of enriching structured cancer cell line CNV data with literature-derived evidence by building a data extraction and exploration system. It combines a fine-tuned LILLIE Open Information Extraction pipeline with ontology-based entity linking and a graph database to integrate PubMed-derived triples with Progenetix and Cancercelllines data, enabling interactive visualization of CNV plots annotated by relevant genes and literature. A key contribution is the augmentation with pair extraction and a formal relationship score $R(D,e_1,e_2) = \sum_{P(e_1) \in D}\sum_{P(e_2) \in D}\log_2^{-1}\left(|P(e_1) - P(e_2)| + 1\right)$ to capture long-distance relations, improving the ranking of gene–cell line associations. The approach yields an F1 of 74.2% on an adapted BioRED benchmark for Gene–CellLine pairs and demonstrates practical utility through case studies (Detroit 562 and MDA-MB-453) and large-scale data exploration, providing a public portal for literature-guided molecular profiling analyses in cancer cell lines.

Abstract

With the proliferation of research means and computational methodologies, published biomedical literature is growing exponentially in numbers and volume. Cancer cell lines are frequently used models in biological and medical research that are currently applied for a wide range of purposes, from studies of cellular mechanisms to drug development, which has led to a wealth of related data and publications. Sifting through large quantities of text to gather relevant information on the cell lines of interest is tedious and extremely slow when performed by humans. Hence, novel computational information extraction and correlation mechanisms are required to boost meaningful knowledge extraction. In this work, we present the design, implementation and application of a novel data extraction and exploration system. This system extracts deep semantic relations between textual entities from scientific literature to enrich existing structured clinical data in the domain of cancer cell lines. We introduce a new public data exploration portal, which enables automatic linking of genomic copy number variants plots with ranked, related entities such as affected genes. Each relation is accompanied by literature-derived evidences, allowing for deep, yet rapid, literature search, using existing structured data as a springboard. Our system is publicly available on the web at https://cancercelllines.org

Data-Driven Information Extraction and Enrichment of Molecular Profiling Data for Cancer Cell Lines

TL;DR

This work tackles the challenge of enriching structured cancer cell line CNV data with literature-derived evidence by building a data extraction and exploration system. It combines a fine-tuned LILLIE Open Information Extraction pipeline with ontology-based entity linking and a graph database to integrate PubMed-derived triples with Progenetix and Cancercelllines data, enabling interactive visualization of CNV plots annotated by relevant genes and literature. A key contribution is the augmentation with pair extraction and a formal relationship score to capture long-distance relations, improving the ranking of gene–cell line associations. The approach yields an F1 of 74.2% on an adapted BioRED benchmark for Gene–CellLine pairs and demonstrates practical utility through case studies (Detroit 562 and MDA-MB-453) and large-scale data exploration, providing a public portal for literature-guided molecular profiling analyses in cancer cell lines.

Abstract

With the proliferation of research means and computational methodologies, published biomedical literature is growing exponentially in numbers and volume. Cancer cell lines are frequently used models in biological and medical research that are currently applied for a wide range of purposes, from studies of cellular mechanisms to drug development, which has led to a wealth of related data and publications. Sifting through large quantities of text to gather relevant information on the cell lines of interest is tedious and extremely slow when performed by humans. Hence, novel computational information extraction and correlation mechanisms are required to boost meaningful knowledge extraction. In this work, we present the design, implementation and application of a novel data extraction and exploration system. This system extracts deep semantic relations between textual entities from scientific literature to enrich existing structured clinical data in the domain of cancer cell lines. We introduce a new public data exploration portal, which enables automatic linking of genomic copy number variants plots with ranked, related entities such as affected genes. Each relation is accompanied by literature-derived evidences, allowing for deep, yet rapid, literature search, using existing structured data as a springboard. Our system is publicly available on the web at https://cancercelllines.org
Paper Structure (13 sections, 1 equation, 8 figures, 1 table)

This paper contains 13 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: An overview of the architecture of our system, which provides a bridge between unstructured textual corpora, and structured clinical data. We first use abstract texts from the Progenetix corpus, along with entity names and synonyms from existing biomedical ontologies such as NCIt and Cellosaurus, to identify textual relational triples using the LILLIE Open Information Extraction system. We then use these triples, along with the relationships from these ontologies, to build a graph database, which is then mapped to existing Copy Number Variant plots from the Progenetix structured database.
  • Figure 2: A sample of the results available for the cell line HOS, including: (1) associated genomic locations mapped on the copy number variation profile plot (gain CNVs yellow, loss CNVs in blue); (2) evidences for each result; (3) and the relevant abstracts from which the results were derived. The results columns are, from left to right: Gene, Cytoband, or other entity labels; Primary evidence for each abstract (the relevant cell line/entity annotations are marked in bold); Abstract title, and a link to the corresponding PubMed article; Expand/Collapse controls to view detailed information (shown in Figure \ref{['fig:helaevidence']}).
  • Figure 3: Graph representation of the relationships in the text "a small-cell lung cancer cell line (NCI-H209) expresses an aberrant underphosphorylated form of the retinoblastoma protein RB1", deriving an EXPRESSES relationship between the cell line NCI-H209 and the gene RB1.
  • Figure 4: Portion of the cell line hierarchy for HeLa, showing the entity itself, and its daughter cell lines. Nodes in the graph are derived from the ontologies (in this case, Cellosaurus), and the edges indicate a 'parent-of' relationship.
  • Figure 5: Section of the results demonstrating a relationship between the cell line HeLa and the gene EGFR, showing the paper title, primary evidence for the result, and, when expanded, the full annotated abstract text.
  • ...and 3 more figures