Export Scientific Papers in Structured Formats

Scientific papers are typically distributed as PDFs — convenient for humans, but terrible for machines, AI, and modern workflows.

ScienceStack transforms LaTeX source into three structured export formats that preserve the full semantic content of research papers:

Markdown (.md) — Human-readable, works with Obsidian/Notion/VSCode, preserves all numbering
JSON (.json) — Machine-native, optimized for LLMs and AI pipelines
LaTeX (.tex) — Raw LaTeX with all macros expanded

All formats preserve equations, section numbers, cross-references, and document structure — making them superior to PDF extraction or generic converters.

🚫 The PDF Problem

PDFs flatten rich document structure into visual layouts, stripping away semantic meaning:

Loss of structure — Sections, figures, theorems, and references are mashed into a page dump
Broken math — Equations are often extracted incorrectly:
- Original LaTeX: x^{y+1} = \frac{a}{b}
- Extracted from PDF: x*y+1 = a/b ❌
- Superscripts and fractions collapse into plain text, changing meaning entirely
No semantic cues — Citations appear as [12] instead of links to actual references
Heavy preprocessing — Tools like GROBID try to reconstruct structure, but it's lossy and compute-intensive
Bad for AI — LLMs waste tokens on noise (line breaks, formatting artifacts, duplicated headers)

✅ Why Structured Exports Matter

Our exports maintain the complete semantic structure of research papers:

Preserved numbering — All equations, sections, tables, figures, and theorems keep their original numbers
Working cross-references — \ref{thm:main} becomes clickable "Theorem 3.2" links, not broken references
Math stays math — Equations remain as LaTeX, not corrupted OCR strings
Citations stay connected — References are machine-resolvable for citation graphs
Context-aware chunking — Papers can be split into semantic blocks (abstract, theorem, proof) without losing meaning

📥 How to Export

To download a paper in any format:

Navigate to any paper on ScienceStack
Click the "Download" button in the top-right navigation bar
Select your preferred format from the dropdown
Configure options (annotations, assets) and download

Markdown Export

Our Markdown export is purpose-built for research papers and significantly more robust than generic LaTeX→Markdown converters.

What Makes Our Markdown Superior

1. Complete Numbering Preservation

Unlike pandoc and other converters, we preserve all numbering from the original paper:

✅ Section numbers — Exactly as in the LaTeX source (e.g., "3.2.1 Main Theorem")
✅ Equation numbers — Every numbered equation keeps its label: (3.12)
✅ Figure & table numbers — "Figure 4", "Table 2.1" with proper captions
✅ Theorem numbers — Lemmas, propositions, corollaries all numbered correctly

Generic converters (like pandoc) typically:

Drop section numbers by default
Lose equation numbers unless hardcoded
Flatten theorem environments into plain text
Break on complex LaTeX structures

2. Linkable Cross-References

All \ref{...} commands become live markdown links with their resolved numbers:

% Original LaTeX
See Theorem~\ref{thm:main} and Equation~\eqref{eq:result}

% Our export
See [Theorem 3.2](#theorem-32) and [Equation (4.1)](#eq-41)

This means:

Click to jump to referenced content
LLMs can accurately answer "explain Equation (3.12)" queries
Readers can navigate complex papers efficiently

3. Complete Asset Package Pro Feature

Enable "Include assets" to download a self-contained package — everything you need to view the paper locally with all figures and diagrams intact.

What you get:

arxiv_1706.03762.zip
├── arxiv_1706.03762.md          # Main paper
└── assets/
    ├── figure_1.webp             # Optimized images
    ├── figure_2.webp
    ├── diagram_architecture.svg  # Crisp vector diagrams
    └── diagram_attention.svg

Why this is powerful:

Optimized formats — Images converted to .webp (smaller, faster), diagrams exported as .svg (scalable, crisp)
Relative paths — Markdown automatically references assets/figure_1.webp, so everything just works
Zero setup — Unzip and open the .md file in any markdown viewer (Obsidian, VSCode, Typora, etc.) — all assets display immediately
Self-contained — The entire paper is portable. No broken links, no missing images, no external dependencies
Vault-ready — Drop directly into your Obsidian vault or note-taking system

This is a complete artifact extracted from our parsed LaTeX AST — not a lossy conversion. Every figure and diagram from the original paper is preserved in modern, optimized formats.

4. LLM-Friendly Annotations

When "Include annotations" is checked:

<!-- LLM: annotations JSON at bottom -->

# Paper Title
[... paper content ...]

<!--ANNOTATIONS
[
  {
    "section": "Theorem 3.2",
    "text": "Let X be a compact manifold...",
    "annotation": "This is the key result of the paper"
  }
]
ENDANNOTATIONS-->

Your notes are embedded as structured JSON in HTML comments — invisible to human readers, but easily parsed by LLMs or scripts.

5. Works Everywhere

Our markdown is compatible with:

Obsidian — Direct import with working links
Notion — Clean rendering with preserved structure
VSCode — Full preview support
GitHub — Renders perfectly in repos and READMEs
Any markdown editor — Standard CommonMark syntax

JSON Export

Our JSON format is machine-native and optimized for AI applications, LLM ingestion, and programmatic analysis.

Why JSON Over PDFs for LLMs?

PDFs are fundamentally visual formats designed for printing, not machine reading. For LLMs and AI applications, this creates serious problems:

Problem	PDF	Our JSON
Math extraction	❌ Corrupted: `x*y+1 = a/b`	✅ LaTeX preserved: `x^{y+1} = \frac{a}{b}`
Structure	❌ Flattened page layout	✅ Full semantic tree
References	❌ Plain text, broken links	✅ Machine-resolvable metadata
Chunking	❌ Arbitrary page breaks	✅ Semantic boundaries
Numbering	❌ OCR errors, often missing	✅ All elements numbered
Context	❌ No type information	✅ Explicit tags: `"type": "theorem"`
Tokens	❌ Repeated headers/footers	✅ Clean content only

Bottom line: PDF extraction tools (GROBID, Nougat) try to reverse-engineer structure from visual layout. We provide the original semantic structure directly from LaTeX source.

Why JSON, Not Raw LaTeX?

Raw LaTeX may look clean, but it's problematic for language models:

Problems with Raw LaTeX

1. Macro hell
LLMs don't expand \newcommand or nested macros, so they often misinterpret notation or fail to parse math entirely.

\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\|#1\right\|}

% LLM sees: \norm{x} \in \R
% Doesn't understand this is a norm in real numbers

2. No numbering
Section and equation numbers aren't in the source — they're assigned at compile time. Without them, neither you nor the model can reference "Equation (3.12)" or "Section 4.1" accurately.

3. No clear boundaries
Sections, theorems, proofs, and equations are just text streams. Our parser makes them explicit blocks so context is clear.

4. Missing content order
LaTeX uses \input, \include, and \externaldocument to stitch together files — we resolve these so you get the complete paper in the right order.

JSON Structure

Our JSON format provides a hierarchical, semantic representation of papers:

{
  "metadata": {
    "title": "Attention Is All You Need",
    "authors": [...],
    "abstract": "...",
    "arxiv_id": "1706.03762"
  },
  "content": [
    {
      "type": "section",
      "number": "1",
      "title": "Introduction",
      "content": [...]
    },
    {
      "type": "equation",
      "number": "1",
      "latex": "\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V",
      "label": "eq:attention"
    },
    {
      "type": "theorem",
      "number": "3.1",
      "title": "Main Result",
      "content": [...],
      "label": "thm:main"
    }
  ],
  "bibliography": [...]
}

Key Properties

Macros expanded — All \newcommand definitions resolved into plain LaTeX
Stable IDs — Every block has a unique identifier for referencing
Numbered elements — All equations, sections, theorems have their final numbers
Semantic types — Explicit tags for abstracts, proofs, definitions, lemmas, etc.
Resolved references — \ref{thm:main} links to the actual theorem block

Annotations in JSON

When "Include annotations" is enabled, your notes are added to the export:

{
  "metadata": {...},
  "content": [...],
  "annotations": [
    {
      "section": "Theorem 3.2",
      "text": "Let X be a compact manifold...",
      "annotation": "This is the key result"
    }
  ]
}

This makes it easy to:

Feed annotated context to LLMs
Build personal knowledge graphs
Share insights with collaborators
Train models on annotated papers

LaTeX Export

Download the raw LaTeX source with all macros expanded and content in the correct order.

What You Get

Our LaTeX export provides:

Macro expansion — All \newcommand, \def, and custom commands resolved
Complete content — All \input and \include files merged in order
Clean formatting — Unnecessary whitespace and comments removed
Bibliography included — References appended as BibTeX entries

Macro Expansion

Unlike downloading raw source from arXiv (which often has dozens of \newcommand definitions), our export expands all macros:

Original source:

\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\|#1\right\|}

The function $f: \R \to \R$ satisfies $\norm{f(x)} < 1$.

Our export:

The function $f: \mathbb{R} \to \mathbb{R}$ satisfies $\left\|f(x)\right\| < 1$.

This makes the LaTeX:

✅ Easier to read and understand
✅ Portable across different LaTeX setups
✅ Reduce hallucination risk of LLMs

Get Started

Try these workflows on any paper:

Browse ScienceStack
Open a paper
Click Download and select your format
Follow the workflow guides above

Need help? Email support@sciencestack.ai