A Plaque Test for Redundancies in Relational Data [Extendend Version]
Christoph Köhnen, Stefan Klessinger, Jens Zumbrägel, Stefanie Scherzinger
TL;DR
The paper addresses visualizing redundancies in relational data implied by functional dependencies by introducing the plaque test, an entropy-based visualization anchored in the information-theoretic framework of Arenas and Libkin. It defines cell-level information content $INF_I(p|F)$, provides a simplified computable formula, and offers two optimization paths plus a Monte Carlo estimator with error bounds to scale to larger datasets. The main contributions include formal definitions, exact and approximate computation strategies, and extensive experiments on five real-world datasets demonstrating useful, interpretable visual cues toward dependencies and normalization opportunities. The work enables data analysts to focus on the most informative redundancies, supporting profiling, normalization decisions, and scalable data exploration, with clear directions for further scalability improvements and extension to additional dependency types.
Abstract
Inspired by the visualization of dental plaque at the dentist's office, this article proposes a novel visualization of redundancies in relational data. Our approach is based on a well-principled information-theoretic framework that has so far seen limited practical application in systems and tools. In this framework, we quantify the information content (or entropy) of each cell in a relation instance given a set of functional dependencies. The entropy value signifies the likelihood of recovering the cell value based on the dependencies and the remaining tuples. By highlighting cells with lower entropy, we effectively visualize redundancies in the data. We present an initial prototype implementation and demonstrate that a straightforward approach is insufficient to handle practical problem sizes. To address this limitation, we propose several optimizations which we prove to be correct. In addition, we present a Monte Carlo approximation with a known error, enabling a computationally tractable analysis. By applying our visualization technique to real-world datasets, we showcase its potential. Our vision is to empower data analysts by directing their focus in data profiling toward pertinent redundancies, analogous to the diagnostic role of a plaque test at the dentist's office.
