Export Scientific Papers in Structured Formats

Scientific papers are typically distributed as PDFs — convenient for humans, but terrible for machines, AI, and modern workflows.

ScienceStack transforms LaTeX source into three structured export formats that preserve the full semantic content of research papers:

  • Markdown (.md) — Human-readable, works with Obsidian/Notion/VSCode, preserves all numbering
  • JSON (.json) — Machine-native, optimized for LLMs and AI pipelines
  • LaTeX (.tex) — Raw LaTeX with all macros expanded

All formats preserve equations, section numbers, cross-references, and document structure — making them superior to PDF extraction or generic converters.


🚫 The PDF Problem

PDFs flatten rich document structure into visual layouts, stripping away semantic meaning:

  • Loss of structure — Sections, figures, theorems, and references are mashed into a page dump
  • Broken math — Equations are often extracted incorrectly:
    • Original LaTeX: x^{y+1} = \frac{a}{b}
    • Extracted from PDF: x*y+1 = a/b
    • Superscripts and fractions collapse into plain text, changing meaning entirely
  • No semantic cues — Citations appear as [12] instead of links to actual references
  • Heavy preprocessing — Tools like GROBID try to reconstruct structure, but it's lossy and compute-intensive
  • Bad for AI — LLMs waste tokens on noise (line breaks, formatting artifacts, duplicated headers)

✅ Why Structured Exports Matter

Our exports maintain the complete semantic structure of research papers:

  • Preserved numbering — All equations, sections, tables, figures, and theorems keep their original numbers
  • Working cross-references\ref{thm:main} becomes clickable "Theorem 3.2" links, not broken references
  • Math stays math — Equations remain as LaTeX, not corrupted OCR strings
  • Citations stay connected — References are machine-resolvable for citation graphs
  • Context-aware chunking — Papers can be split into semantic blocks (abstract, theorem, proof) without losing meaning

📥 How to Export

To download a paper in any format:

  1. Navigate to any paper on ScienceStack
  2. Click the "Download" button in the top-right navigation bar
  3. Select your preferred format from the dropdown
  4. Configure options (annotations, assets) and download

Markdown Export

Our Markdown export is purpose-built for research papers and significantly more robust than generic LaTeX→Markdown converters.

What Makes Our Markdown Superior

1. Complete Numbering Preservation

Unlike pandoc and other converters, we preserve all numbering from the original paper:

  • Section numbers — Exactly as in the LaTeX source (e.g., "3.2.1 Main Theorem")
  • Equation numbers — Every numbered equation keeps its label: (3.12)
  • Figure & table numbers — "Figure 4", "Table 2.1" with proper captions
  • Theorem numbers — Lemmas, propositions, corollaries all numbered correctly

Generic converters (like pandoc) typically:

  • Drop section numbers by default
  • Lose equation numbers unless hardcoded
  • Flatten theorem environments into plain text
  • Break on complex LaTeX structures

2. Linkable Cross-References

All \ref{...} commands become live markdown links with their resolved numbers:

% Original LaTeX
See Theorem~\ref{thm:main} and Equation~\eqref{eq:result}
% Our export
See [Theorem 3.2](#theorem-32) and [Equation (4.1)](#eq-41)

This means:

  • Click to jump to referenced content
  • LLMs can accurately answer "explain Equation (3.12)" queries
  • Readers can navigate complex papers efficiently

3. Complete Asset Package Pro Feature

Enable "Include assets" to download a self-contained package — everything you need to view the paper locally with all figures and diagrams intact.

What you get:

arxiv_1706.03762.zip
├── arxiv_1706.03762.md          # Main paper
└── assets/
    ├── figure_1.webp             # Optimized images
    ├── figure_2.webp
    ├── diagram_architecture.svg  # Crisp vector diagrams
    └── diagram_attention.svg

Why this is powerful:

  • Optimized formats — Images converted to .webp (smaller, faster), diagrams exported as .svg (scalable, crisp)
  • Relative paths — Markdown automatically references assets/figure_1.webp, so everything just works
  • Zero setup — Unzip and open the .md file in any markdown viewer (Obsidian, VSCode, Typora, etc.) — all assets display immediately
  • Self-contained — The entire paper is portable. No broken links, no missing images, no external dependencies
  • Vault-ready — Drop directly into your Obsidian vault or note-taking system

This is a complete artifact extracted from our parsed LaTeX AST — not a lossy conversion. Every figure and diagram from the original paper is preserved in modern, optimized formats.

4. LLM-Friendly Annotations

When "Include annotations" is checked:

<!-- LLM: annotations JSON at bottom -->

# Paper Title
[... paper content ...]

<!--ANNOTATIONS
[
  {
    "section": "Theorem 3.2",
    "text": "Let X be a compact manifold...",
    "annotation": "This is the key result of the paper"
  }
]
ENDANNOTATIONS-->

Your notes are embedded as structured JSON in HTML comments — invisible to human readers, but easily parsed by LLMs or scripts.

5. Works Everywhere

Our markdown is compatible with:

  • Obsidian — Direct import with working links

  • Notion — Clean rendering with preserved structure

  • VSCode — Full preview support

  • GitHub — Renders perfectly in repos and READMEs

  • Any markdown editor — Standard CommonMark syntax


JSON Export

Our JSON format is machine-native and optimized for AI applications, LLM ingestion, and programmatic analysis.

Why JSON Over PDFs for LLMs?

PDFs are fundamentally visual formats designed for printing, not machine reading. For LLMs and AI applications, this creates serious problems:

ProblemPDFOur JSON
Math extraction❌ Corrupted: x*y+1 = a/b✅ LaTeX preserved: x^{y+1} = \frac{a}{b}
Structure❌ Flattened page layout✅ Full semantic tree
References❌ Plain text, broken links✅ Machine-resolvable metadata
Chunking❌ Arbitrary page breaks✅ Semantic boundaries
Numbering❌ OCR errors, often missing✅ All elements numbered
Context❌ No type information✅ Explicit tags: "type": "theorem"
Tokens❌ Repeated headers/footers✅ Clean content only

Bottom line: PDF extraction tools (GROBID, Nougat) try to reverse-engineer structure from visual layout. We provide the original semantic structure directly from LaTeX source.


Why JSON, Not Raw LaTeX?

Raw LaTeX may look clean, but it's problematic for language models:

Problems with Raw LaTeX

1. Macro hell
LLMs don't expand \newcommand or nested macros, so they often misinterpret notation or fail to parse math entirely.

\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\|#1\right\|}

% LLM sees: \norm{x} \in \R
% Doesn't understand this is a norm in real numbers

2. No numbering
Section and equation numbers aren't in the source — they're assigned at compile time. Without them, neither you nor the model can reference "Equation (3.12)" or "Section 4.1" accurately.

3. No clear boundaries
Sections, theorems, proofs, and equations are just text streams. Our parser makes them explicit blocks so context is clear.

4. Missing content order
LaTeX uses \input, \include, and \externaldocument to stitch together files — we resolve these so you get the complete paper in the right order.


JSON Structure

Our JSON format provides a hierarchical, semantic representation of papers:

{
  "metadata": {
    "title": "Attention Is All You Need",
    "authors": [...],
    "abstract": "...",
    "arxiv_id": "1706.03762"
  },
  "content": [
    {
      "type": "section",
      "number": "1",
      "title": "Introduction",
      "content": [...]
    },
    {
      "type": "equation",
      "number": "1",
      "latex": "\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V",
      "label": "eq:attention"
    },
    {
      "type": "theorem",
      "number": "3.1",
      "title": "Main Result",
      "content": [...],
      "label": "thm:main"
    }
  ],
  "bibliography": [...]
}

Key Properties

  • Macros expanded — All \newcommand definitions resolved into plain LaTeX
  • Stable IDs — Every block has a unique identifier for referencing
  • Numbered elements — All equations, sections, theorems have their final numbers
  • Semantic types — Explicit tags for abstracts, proofs, definitions, lemmas, etc.
  • Resolved references\ref{thm:main} links to the actual theorem block

Annotations in JSON

When "Include annotations" is enabled, your notes are added to the export:

{
  "metadata": {...},
  "content": [...],
  "annotations": [
    {
      "section": "Theorem 3.2",
      "text": "Let X be a compact manifold...",
      "annotation": "This is the key result"
    }
  ]
}

This makes it easy to:

  • Feed annotated context to LLMs
  • Build personal knowledge graphs
  • Share insights with collaborators
  • Train models on annotated papers

LaTeX Export

Download the raw LaTeX source with all macros expanded and content in the correct order.

What You Get

Our LaTeX export provides:

  • Macro expansion — All \newcommand, \def, and custom commands resolved
  • Complete content — All \input and \include files merged in order
  • Clean formatting — Unnecessary whitespace and comments removed
  • Bibliography included — References appended as BibTeX entries

Macro Expansion

Unlike downloading raw source from arXiv (which often has dozens of \newcommand definitions), our export expands all macros:

Original source:

\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\|#1\right\|}

The function $f: \R \to \R$ satisfies $\norm{f(x)} < 1$.

Our export:

The function $f: \mathbb{R} \to \mathbb{R}$ satisfies $\left\|f(x)\right\| < 1$.

This makes the LaTeX:

  • ✅ Easier to read and understand

  • ✅ Portable across different LaTeX setups

  • ✅ Reduce hallucination risk of LLMs

Get Started

Try these workflows on any paper:

  1. Browse ScienceStack
  2. Open a paper
  3. Click Download and select your format
  4. Follow the workflow guides above

Need help? Email support@sciencestack.ai