Business Entity Entropy
Adam McCabe, Matthew H. Chequers
TL;DR
Business Entity Entropy introduces an entropy-based framework to quantify how knowledge about an entity is distributed across an organizational document corpus. It defines $p_E(d)$ and $H(E)$ to measure dispersion, reveals a heavy-tailed entropy distribution where a small set of high-entropy entities drive complexity, and shows that larger entities tend to exhibit higher entropy. The authors propose a hierarchical, diffusion-inspired generative model to describe gradual entropy growth and discuss its applications for prioritizing pre-processing and summarization in retrieval-augmented generation (RAG) pipelines. The work provides practical guidance for entity-centric retrieval design and a theoretical basis for modeling organization-wide knowledge evolution in enterprise memory.
Abstract
Organizations generate vast amounts of interconnected content across various platforms. While language models enable sophisticated reasoning for use in business applications, retrieving and contextualizing information from organizational memory remains challenging. We explore this challenge through the lens of entropy, proposing a measure of entity entropy to quantify the distribution of an entity's knowledge across documents as well as a novel generative model inspired by diffusion models in order to provide an explanation for observed behaviours. Empirical analysis on a large-scale enterprise corpus reveals heavy-tailed entropy distributions, a correlation between entity size and entropy, and category-specific entropy patterns. These findings suggest that not all entities are equally retrievable, motivating the need for entity-centric retrieval or pre-processing strategies for a subset of, but not all, entities. We discuss practical implications and theoretical models to guide the design of more efficient knowledge retrieval systems.
