Table of Contents
Fetching ...

Association via Entropy Reduction

Anthony Gamst, Lawrence Wilson

TL;DR

The paper addresses identifying associated document pairs and groups by introducing aver, an entropy-based score derived from a simple rank-one model. It contrasts aver with the standard tf-idf approach, showing that aver offers a natural threshold (0) and can outperform tf-idf at low false-positive rates, even when applied to groups rather than just pairs. Through experiments on the Orkut dataset and a detailed small-example, the authors demonstrate how aver captures collaboration structure and can reveal large, densely connected subsets that tf-idf may miss. They also discuss theoretical properties, computation, and practical trade-offs, arguing that entropy-based grounding provides a more natural interpretation of association, with meaningful implications for larger-scale community detection.

Abstract

Prior to recent successes using neural networks, term frequency-inverse document frequency (tf-idf) was clearly regarded as the best choice for identifying documents related to a query. We provide a different score, aver, and observe, on a dataset with ground truth marking for association, that aver does do better at finding assciated pairs than tf-idf. This example involves finding associated vertices in a large graph and that may be an area where neural networks are not currently an obvious best choice. Beyond this one anecdote, we observe that (1) aver has a natural threshold for declaring pairs as unassociated while tf-idf does not, (2) aver can distinguish between pairs of documents for which tf-idf gives a score of 1.0, (3) aver can be applied to larger collections of documents than pairs while tf-idf cannot, and (4) that aver is derived from entropy under a simple statistical model while tf-idf is a construction designed to achieve a certain goal and hence aver may be more "natural." To be fair, we also observe that (1) writing down and computing the aver score for a pair is more complex than for tf-idf and (2) that the fact that the aver score is naturally scale-free makes it more complicated to interpret aver scores.

Association via Entropy Reduction

TL;DR

The paper addresses identifying associated document pairs and groups by introducing aver, an entropy-based score derived from a simple rank-one model. It contrasts aver with the standard tf-idf approach, showing that aver offers a natural threshold (0) and can outperform tf-idf at low false-positive rates, even when applied to groups rather than just pairs. Through experiments on the Orkut dataset and a detailed small-example, the authors demonstrate how aver captures collaboration structure and can reveal large, densely connected subsets that tf-idf may miss. They also discuss theoretical properties, computation, and practical trade-offs, arguing that entropy-based grounding provides a more natural interpretation of association, with meaningful implications for larger-scale community detection.

Abstract

Prior to recent successes using neural networks, term frequency-inverse document frequency (tf-idf) was clearly regarded as the best choice for identifying documents related to a query. We provide a different score, aver, and observe, on a dataset with ground truth marking for association, that aver does do better at finding assciated pairs than tf-idf. This example involves finding associated vertices in a large graph and that may be an area where neural networks are not currently an obvious best choice. Beyond this one anecdote, we observe that (1) aver has a natural threshold for declaring pairs as unassociated while tf-idf does not, (2) aver can distinguish between pairs of documents for which tf-idf gives a score of 1.0, (3) aver can be applied to larger collections of documents than pairs while tf-idf cannot, and (4) that aver is derived from entropy under a simple statistical model while tf-idf is a construction designed to achieve a certain goal and hence aver may be more "natural." To be fair, we also observe that (1) writing down and computing the aver score for a pair is more complex than for tf-idf and (2) that the fact that the aver score is naturally scale-free makes it more complicated to interpret aver scores.

Paper Structure

This paper contains 10 sections, 1 theorem, 13 equations, 2 figures, 4 tables.

Key Result

Theorem 1

Provided that the collaboration accounts for a sufficiently small fraction of the total observed data, that is, if then when we hold all other terms in the formula constant.

Figures (2)

  • Figure 1: As we change the cutoff threshold for tf-idf (blue) and aver (red), we get different false positive and true positive rates for the surviving pairs. We emphasize the results for a threshold of 0 for aver. Given a fixed false positive rate, we would prefer a higher true positive rate.
  • Figure 2: The 10 highest scoring pairs in tf-idf (left) and aver (aver). The five pairs that occur in both top 10s are connected, the other five pairs (for each) point to where they would be in the other list if it went far enough. The color of the line reflects the number of top 5000 groups containing both of the users in the pair.

Theorems & Definitions (1)

  • Theorem 1