Can we measure the impact of a database?

Peter Buneman; Dennis Dosso; Matteo Lissandrini; Gianmaria Silvello; He Sun

Can we measure the impact of a database?

Peter Buneman, Dennis Dosso, Matteo Lissandrini, Gianmaria Silvello, He Sun

TL;DR

This work extends the $h$-index to hierarchical database structures, enabling measurement of database impact via antichains to prevent double-counting and a polynomial-time algorithm to compute the index. By applying the method to DrugBank, GtoPdb, and NCBI Taxonomy, it shows that the hierarchy-based index can exceed flat, leaf-only counts and that transformations like lifting can further increase the measure when appropriate. The approach provides a principled way to credit curators and contributors through structured decompositions, while highlighting that data citation practice and classification schemes remain evolving challenges. Overall, the paper establishes a practical, scalable framework for quantifying database impact with potential implications for data crediting and curation.

Abstract

In disseminating scientific and statistical data, on-line databases have almost completely replaced traditional paper-based media such as journals and reference works. Given this, can we measure the impact of a database in the same way that we measure an author's or journal's impact? To do this, we need somehow to represent a database as a set of publications, and databases typically allow a large number of possible decompositions into parts, any of which could be treated as a publication. We show that the definition of the h-index naturally extends to hierarchies, so that if a database admits some kind of hierarchical interpretation we can use this as one measure of the importance of a database; moreover, this can be computed as efficiently as one can compute the normal h-index. This also gives us a decomposition of the database that might be used for other purposes such as giving credit to the curators or contributors to the database. We illustrate the process by analyzing three widely used databases.

Can we measure the impact of a database?

TL;DR

This work extends the

-index to hierarchical database structures, enabling measurement of database impact via antichains to prevent double-counting and a polynomial-time algorithm to compute the index. By applying the method to DrugBank, GtoPdb, and NCBI Taxonomy, it shows that the hierarchy-based index can exceed flat, leaf-only counts and that transformations like lifting can further increase the measure when appropriate. The approach provides a principled way to credit curators and contributors through structured decompositions, while highlighting that data citation practice and classification schemes remain evolving challenges. Overall, the paper establishes a practical, scalable framework for quantifying database impact with potential implications for data crediting and curation.

Abstract

Paper Structure (15 sections, 3 theorems, 5 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 3 theorems, 5 figures, 3 tables, 1 algorithm.

Introduction
The h-index of a hierarchy
Hierarchies and antichains
A top-down algorithm
Algorithm discussion
Experimental Analysis
Drugbank
GtoPdb
NCBI Taxonomy
Conclusions
Proofs
Further details of the databases
Drugbank
GtoPdb
NCBI taxonomy

Key Result

proposition 1

Let $A$ be an antichain in a ranked hierarchy $(H,\preccurlyeq,r)$ with a rank-minimal node of rank $l$. Then, there exists an $l$-antichain $A' \subseteq H$ such that $|A'| \geq |A|$.

Figures (5)

Figure 1: Partial representation of the Drugbank hierarchical structure. The web pages associated with the nodes are shown on the left. In Drugbank only the leaves of the hierarchy are directly cited; e.g., we can see that the citations of "Lactones" is the aggregation of the citations to the drugs (we report only Lovastatin and Erythromycin) belonging to the class.
Figure 2: A hierarchical citation structure in which the level of each node is determined by its rank -- the number of citations to it. Columns: R -- the level; N -- the length of the maximal $l$-antichain at that level; and H -- the h-index for that antichain.
Figure 3: Part of the IUPHAR/BPS hierarchical structure. The root node represents the whole databases, and the family structure is not stratified as families may have subfamilies. All the nodes in the hierarchy can be independently cited; e.g., we show some sample citation numbers for the GPCR branch of the tree, where the internal nodes can receive direct citations (#cit) that could be also aggregated (#cit agg) with the citations of the child nodes.
Figure 4: Example of one hierarchy (a) and its corresponding "lifted" version (b). Once again, R is the value of the rank at each level, N is the cardinality of the maximum antichain at each level, H is the h-index of that antichain. Below each hierarchy, its h-index is highlighted.
Figure 5: Example of the procedure described in Lemma \ref{['lemma:1']} applied on one example hierarchy.

Theorems & Definitions (7)

definition 1
definition 2
definition 3
definition 4
proposition 1
proposition 2
proposition 3

Can we measure the impact of a database?

TL;DR

Abstract

Can we measure the impact of a database?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (7)