Why JSON, not PDFs?

Scientific papers today are almost always distributed as PDFs.
While convenient for humans to read, PDFs are not good for machines or AI.
They flatten a paper’s rich structure—sections, equations, references—into a visual layout, stripping away the underlying meaning. For example, equations are reduced to glyphs, with the original LaTeX or semantic intent lost.

For AI and LLMs, this difference can be significant — especially when mathematical equations are involved.


🚫 The PDF Problem

  • Loss of structure — sections, figures, theorems, and references are mashed into a page dump.
  • Broken math — equations are often extracted incorrectly. For example:

Original (LaTeX): x^{y+1} = \frac{a}{b}
Extracted from PDF may look like: x*y+1 = a/b

Superscripts and fractions collapse into plain text, changing the meaning entirely.

  • No semantic cues — citations look like [12] instead of links to the actual reference.
  • Heavy preprocessing required — tools like GROBID try to reconstruct structure, but it's lossy and compute-expensive.
  • Bad for AI ingestion — LLMs waste context tokens on noise (line breaks, formatting junk, duplicated headers).

Why JSON, not raw LaTeX?

Raw LaTeX looks fine to a compiler, but it's a mess for language models:

  • Macro hell: LLMs don't expand \newcommand or nested macros, so they often misinterpret notation or fail to parse the math at all.
  • No numbering: Section and equation numbers aren't in the source — they're assigned at compile time. Without them, neither you nor the model can anchor on "equation (3.12)" or "Section 4.1."
  • No clear boundaries: Sections, theorems, proofs, and equations are just text streams. Our parser makes them explicit blocks so context is clear.
  • Missing content order: LaTeX uses e.g. \input/\include/externaldocument to stitch together files — we resolve these so you get the complete paper in the right order.

Structured JSON resolves all of this by expanding macros into plain MathJax, materializing real numbers, and outputting exact section, equation, table, and environment blocks the model can consume directly.


✅ The JSON Advantage

Our LaTeX-to-JSON format keeps papers in a structured, machine-native representation:

  • Preserves semantics — abstracts, sections, proofs, equations, and references are all labeled and explicit.
  • Math stays math — equations remain as LaTeX:
{ "equation": "x^{y+1} = \\frac{a}{b}" }

LLMs can reason over the actual math, not a corrupted OCR string or macro nested commands.

  • Citations stay connected — references are machine-resolvable, enabling citation graphs and dependency maps.
  • Chunkable for LLMs — papers can be split cleanly into context-aware blocks (abstract, theorem, proof) without losing meaning.
  • Machine Readable — the same JSON can drive AI summaries, peer review automation, slide/poster generation, or interactive readers.

🚀 Try It Yourself

To download a paper in our structured JSON format:

  1. Navigate to any paper on ScienceStack
  2. Click the "Download" button in the navigation bar (top right)
  3. Select "JSON" format from the dropdown menu

In the JSON, you'll notice how the paper's semantic structure — from equations to citations — is preserved in a clean format you can upload directly to any LLM.