Table of Contents
Fetching ...

Data-driven Coreference-based Ontology Building

Shir Ashury-Tahan, Amir David Nissan Cohen, Nadav Cohen, Yoram Louzoun, Yoav Goldberg

TL;DR

This work derives coreference chains from a corpus of 30 million biomedical abstracts and constructs a graph based on the string phrases within these chains, establishing connections between phrases if they co-occur within the same coreference chain.

Abstract

While coreference resolution is traditionally used as a component in individual document understanding, in this work we take a more global view and explore what can we learn about a domain from the set of all document-level coreference relations that are present in a large corpus. We derive coreference chains from a corpus of 30 million biomedical abstracts and construct a graph based on the string phrases within these chains, establishing connections between phrases if they co-occur within the same coreference chain. We then use the graph structure and the betweeness centrality measure to distinguish between edges denoting hierarchy, identity and noise, assign directionality to edges denoting hierarchy, and split nodes (strings) that correspond to multiple distinct concepts. The result is a rich, data-driven ontology over concepts in the biomedical domain, parts of which overlaps significantly with human-authored ontologies. We release the coreference chains and resulting ontology under a creative-commons license, along with the code.

Data-driven Coreference-based Ontology Building

TL;DR

This work derives coreference chains from a corpus of 30 million biomedical abstracts and constructs a graph based on the string phrases within these chains, establishing connections between phrases if they co-occur within the same coreference chain.

Abstract

While coreference resolution is traditionally used as a component in individual document understanding, in this work we take a more global view and explore what can we learn about a domain from the set of all document-level coreference relations that are present in a large corpus. We derive coreference chains from a corpus of 30 million biomedical abstracts and construct a graph based on the string phrases within these chains, establishing connections between phrases if they co-occur within the same coreference chain. We then use the graph structure and the betweeness centrality measure to distinguish between edges denoting hierarchy, identity and noise, assign directionality to edges denoting hierarchy, and split nodes (strings) that correspond to multiple distinct concepts. The result is a rich, data-driven ontology over concepts in the biomedical domain, parts of which overlaps significantly with human-authored ontologies. We release the coreference chains and resulting ontology under a creative-commons license, along with the code.

Paper Structure

This paper contains 26 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Resulting Ontology Example that may reflect the type of structure achievable using our method.
  • Figure 2: Co-occurrence behavior example demonstrating why the more general the phrase, the more central the phrase. Each phrase can appear with any of the phrases that are more specific than it, making a phrase like "disease" a bridge between communities that is much more central than "breast cancer" in our graph.
  • Figure 3: Example of directions assignment to the edges in the graph. The upper graph demonstrates the connections between phrases that appeared in our corpus, and their betweeness centrality (BC) values in this graph. The one below shows the result of a directed graph using them.
  • Figure 4: Fixing Edge Direction in cases where a name (e.g., "COVID-19") co-occurs with others in coreference chains more frequently than its general phrase neighbor (e.g., "epidemic"). Our solution (on the right) for correcting the directionality in these cases helps make the paths more accurate. (The gray background represents a concept)
  • Figure 5: Union Nodes to a Concept when a common name (e.g., "gys2") is incorrectly identified as being more hierarchical than its alias neighbors. Our solution (on the right) that is based on semantic similarity representations helps in solving such cases. (The gray background represents a concept)
  • ...and 1 more figures