Table of Contents
Fetching ...

Business Entity Entropy

Adam McCabe, Matthew H. Chequers

TL;DR

Business Entity Entropy introduces an entropy-based framework to quantify how knowledge about an entity is distributed across an organizational document corpus. It defines $p_E(d)$ and $H(E)$ to measure dispersion, reveals a heavy-tailed entropy distribution where a small set of high-entropy entities drive complexity, and shows that larger entities tend to exhibit higher entropy. The authors propose a hierarchical, diffusion-inspired generative model to describe gradual entropy growth and discuss its applications for prioritizing pre-processing and summarization in retrieval-augmented generation (RAG) pipelines. The work provides practical guidance for entity-centric retrieval design and a theoretical basis for modeling organization-wide knowledge evolution in enterprise memory.

Abstract

Organizations generate vast amounts of interconnected content across various platforms. While language models enable sophisticated reasoning for use in business applications, retrieving and contextualizing information from organizational memory remains challenging. We explore this challenge through the lens of entropy, proposing a measure of entity entropy to quantify the distribution of an entity's knowledge across documents as well as a novel generative model inspired by diffusion models in order to provide an explanation for observed behaviours. Empirical analysis on a large-scale enterprise corpus reveals heavy-tailed entropy distributions, a correlation between entity size and entropy, and category-specific entropy patterns. These findings suggest that not all entities are equally retrievable, motivating the need for entity-centric retrieval or pre-processing strategies for a subset of, but not all, entities. We discuss practical implications and theoretical models to guide the design of more efficient knowledge retrieval systems.

Business Entity Entropy

TL;DR

Business Entity Entropy introduces an entropy-based framework to quantify how knowledge about an entity is distributed across an organizational document corpus. It defines and to measure dispersion, reveals a heavy-tailed entropy distribution where a small set of high-entropy entities drive complexity, and shows that larger entities tend to exhibit higher entropy. The authors propose a hierarchical, diffusion-inspired generative model to describe gradual entropy growth and discuss its applications for prioritizing pre-processing and summarization in retrieval-augmented generation (RAG) pipelines. The work provides practical guidance for entity-centric retrieval design and a theoretical basis for modeling organization-wide knowledge evolution in enterprise memory.

Abstract

Organizations generate vast amounts of interconnected content across various platforms. While language models enable sophisticated reasoning for use in business applications, retrieving and contextualizing information from organizational memory remains challenging. We explore this challenge through the lens of entropy, proposing a measure of entity entropy to quantify the distribution of an entity's knowledge across documents as well as a novel generative model inspired by diffusion models in order to provide an explanation for observed behaviours. Empirical analysis on a large-scale enterprise corpus reveals heavy-tailed entropy distributions, a correlation between entity size and entropy, and category-specific entropy patterns. These findings suggest that not all entities are equally retrievable, motivating the need for entity-centric retrieval or pre-processing strategies for a subset of, but not all, entities. We discuss practical implications and theoretical models to guide the design of more efficient knowledge retrieval systems.

Paper Structure

This paper contains 34 sections, 3 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Distribution of entity entropies across the corpus, showing a right-skewed distribution with a long tail extending into higher entropy regions. As intuition would suggest it was these entities which were central to how the business operates and included entities such as the company itself, its core product offering, the core repository in which code lived and founding team members. For ease of interpretation we have visualized the distribution (solid green line) using kernel density estimation.
  • Figure 2: Scatter plot showing the relationship between entity size (total facts) and entropy. An example linear relationship is depicted by the dashed line. Note the non-linear relationship between size in facts and an entity's entropy.
  • Figure 3: Rank-ordered plot of entities by the number of documents needed for 95% coverage, demonstrating the heavy-tailed nature of document requirements across entities. Vertical dashed red line inserted at the 90th percentile entity.
  • Figure 4: Heatmap visualizing the adjacency matrix of the connectivity graph for underlying entities. Connectivity is defined as having a shared document, edge weighting according to the number of shared documents. The left plot shows all entities ordered decreasing by total associated documents (points exagerated for readability), while the right plot zooms in on the bottom left corner to show only the top 100 entities. Most points are dark, indiciating low connectivity, the exception being the most entropic entities (as expected). Note, an empty cell (white) represents no shared documents.
  • Figure 5: Temporal evolution of entropy for the top 5 most entropic entities in the dataset. The behaviour of these largest entities is very similar and likely due to the high overlap in shared documents (see Section \ref{['sec:document_overlap']})
  • ...and 8 more figures