Table of Contents
Fetching ...

Story Beyond the Eye: Glyph Positions Break PDF Text Redaction

Maxwell Bland, Anushya Iyer, Kirill Levchenko

TL;DR

The paper reveals that PDF redactions can leak information through subpixel glyph position shifts, not just character width, enabling deredaction under several common workflows. It introduces Edact-Ray, a tool suite to locate, analyze, and repair vulnerable redactions, and performs a large-scale, information-theoretic assessment of leakage across corpora and fonts using dictionaries of plausible redacted terms. The study shows that many real-world redactions—especially those produced via Microsoft Word’s dependent glyph-shifting schemes—can reveal nontrivial amounts of information (up to around 15 bits and high correct-guess probabilities), and that rasterization does not fully mitigate leakage. The authors provide defense strategies, practical recommendations, and responsible disclosure efforts, underscoring the need for robust redaction practices in both software tools and document workflows. Overall, the work establishes a measurable, information-theoretic risk in PDF redactions and offers concrete methodologies and tools to identify and remediate vulnerable redactions in practice.

Abstract

In this work we find that many current redactions of PDF text are insecure due to non-redacted character positioning information. In particular, subpixel-sized horizontal shifts in redacted and non-redacted characters can be recovered and used to effectively deredact first and last names. Unfortunately these findings affect redactions where the text underneath the black box is removed from the PDF. We demonstrate these findings by performing a comprehensive vulnerability assessment of common PDF redaction types. We examine 11 popular PDF redaction tools, including Adobe Acrobat, and find that they leak information about redacted text. We also effectively deredact hundreds of real-world PDF redactions, including those found in OIG investigation reports and FOIA responses. To correct the problem, we have released open source algorithms to fix trivial redactions and reduce the amount of information leaked by nonexcising redactions (where the text underneath the redaction is copy-pastable). We have also notified the developers of the studied redaction tools. We have notified the Office of Inspector General, the Free Law Project, PACER, Adobe, Microsoft, and the US Department of Justice. We are working with several of these groups to prevent our discoveries from being used for malicious purposes.

Story Beyond the Eye: Glyph Positions Break PDF Text Redaction

TL;DR

The paper reveals that PDF redactions can leak information through subpixel glyph position shifts, not just character width, enabling deredaction under several common workflows. It introduces Edact-Ray, a tool suite to locate, analyze, and repair vulnerable redactions, and performs a large-scale, information-theoretic assessment of leakage across corpora and fonts using dictionaries of plausible redacted terms. The study shows that many real-world redactions—especially those produced via Microsoft Word’s dependent glyph-shifting schemes—can reveal nontrivial amounts of information (up to around 15 bits and high correct-guess probabilities), and that rasterization does not fully mitigate leakage. The authors provide defense strategies, practical recommendations, and responsible disclosure efforts, underscoring the need for robust redaction practices in both software tools and document workflows. Overall, the work establishes a measurable, information-theoretic risk in PDF redactions and offers concrete methodologies and tools to identify and remediate vulnerable redactions in practice.

Abstract

In this work we find that many current redactions of PDF text are insecure due to non-redacted character positioning information. In particular, subpixel-sized horizontal shifts in redacted and non-redacted characters can be recovered and used to effectively deredact first and last names. Unfortunately these findings affect redactions where the text underneath the black box is removed from the PDF. We demonstrate these findings by performing a comprehensive vulnerability assessment of common PDF redaction types. We examine 11 popular PDF redaction tools, including Adobe Acrobat, and find that they leak information about redacted text. We also effectively deredact hundreds of real-world PDF redactions, including those found in OIG investigation reports and FOIA responses. To correct the problem, we have released open source algorithms to fix trivial redactions and reduce the amount of information leaked by nonexcising redactions (where the text underneath the redaction is copy-pastable). We have also notified the developers of the studied redaction tools. We have notified the Office of Inspector General, the Free Law Project, PACER, Adobe, Microsoft, and the US Department of Justice. We are working with several of these groups to prevent our discoveries from being used for malicious purposes.
Paper Structure (53 sections, 1 equation, 8 figures, 8 tables)

This paper contains 53 sections, 1 equation, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The TJ text showing operator, which specifies the glyphs to render and, by reference to a font object (not shown), their widths, along with any associated positional adjustments, given in text space units.
  • Figure 2: Snippet of reverse engineered code representing how Microsoft Word leaks redacted character information into non-redacted characters in a PDF document.
  • Figure 3: Nontrivial redaction location algorithm
  • Figure 4: Two-pass algorithm for locating trivial redactions
  • Figure 5: Word WYSIWYG width adjustment method.
  • ...and 3 more figures