Probabilistic Reasoning with LLMs for k-anonymity Estimation
Jonathan Zheng, Sauvik Das, Alan Ritter, Wei Xu
TL;DR
BRANCH presents a novel probabilistic reasoning framework for LLMs to estimate privacy risk in user-generated text by predicting the k-anonymity of disclosed attributes. By implicitly constructing a Bayesian network over disclosures and estimating conditional probabilities with LLMs, BRANCH reconstructs the joint distribution to compute $\,\hat{k}=n\cdot p$. Empirical evaluation on a human-annotated Reddit/ShareGPT dataset shows BRANCH outperforms Chain-of-Thought baselines, particularly on complex posts with many attributes, and uncertainty signals effectively flag lower-confidence estimates. The work advances both probabilistic reasoning in LLMs and practical privacy risk assessment, offering a foundation for user-facing privacy tools that quantify identification risk in online disclosures.
Abstract
Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a new numerical reasoning task under uncertainty for large language models, focusing on estimating the privacy risk of user-generated documents containing privacy-sensitive information. We propose BRANCH, a new LLM methodology that estimates the k-privacy value of a text-the size of the population matching the given information. BRANCH factorizes a joint probability distribution of personal information as random variables. The probability of each factor in a population is estimated separately using a Bayesian network and combined to compute the final k-value. Our experiments show that this method successfully estimates the k-value 73% of the time, a 13% increase compared to o3-mini with chain-of-thought reasoning. We also find that LLM uncertainty is a good indicator for accuracy, as high-variance predictions are 37.47% less accurate on average.
