Export Scientific Papers in Structured Formats
Scientific papers are typically distributed as PDFs — convenient for humans, but terrible for machines, AI, and modern workflows.
ScienceStack transforms LaTeX source into three structured export formats that preserve the full semantic content of research papers:
- Markdown (
.md) — Human-readable, works with Obsidian/Notion/VSCode, preserves all numbering - JSON (
.json) — Machine-native, optimized for LLMs and AI pipelines - LaTeX (
.tex) — Raw LaTeX with all macros expanded
All formats preserve equations, section numbers, cross-references, and document structure — making them superior to PDF extraction or generic converters.
🚫 The PDF Problem
PDFs flatten rich document structure into visual layouts, stripping away semantic meaning:
- Loss of structure — Sections, figures, theorems, and references are mashed into a page dump
- Broken math — Equations are often extracted incorrectly:
- Original LaTeX:
x^{y+1} = \frac{a}{b} - Extracted from PDF:
x*y+1 = a/b❌ - Superscripts and fractions collapse into plain text, changing meaning entirely
- Original LaTeX:
- No semantic cues — Citations appear as
[12]instead of links to actual references - Heavy preprocessing — Tools like GROBID try to reconstruct structure, but it's lossy and compute-intensive
- Bad for AI — LLMs waste tokens on noise (line breaks, formatting artifacts, duplicated headers)
✅ Why Structured Exports Matter
Our exports maintain the complete semantic structure of research papers:
- Preserved numbering — All equations, sections, tables, figures, and theorems keep their original numbers
- Working cross-references —
\ref{thm:main}becomes clickable "Theorem 3.2" links, not broken references - Math stays math — Equations remain as LaTeX, not corrupted OCR strings
- Citations stay connected — References are machine-resolvable for citation graphs
- Context-aware chunking — Papers can be split into semantic blocks (abstract, theorem, proof) without losing meaning
📥 How to Export
To download a paper in any format:
- Navigate to any paper on ScienceStack
- Click the "Download" button in the top-right navigation bar
- Select your preferred format from the dropdown
- Configure options (annotations, assets) and download
Markdown Export
Our Markdown export is purpose-built for research papers and significantly more robust than generic LaTeX→Markdown converters.
What Makes Our Markdown Superior
1. Complete Numbering Preservation
Unlike pandoc and other converters, we preserve all numbering from the original paper:
- ✅ Section numbers — Exactly as in the LaTeX source (e.g., "3.2.1 Main Theorem")
- ✅ Equation numbers — Every numbered equation keeps its label:
(3.12) - ✅ Figure & table numbers — "Figure 4", "Table 2.1" with proper captions
- ✅ Theorem numbers — Lemmas, propositions, corollaries all numbered correctly
Generic converters (like pandoc) typically:
- Drop section numbers by default
- Lose equation numbers unless hardcoded
- Flatten theorem environments into plain text
- Break on complex LaTeX structures
2. Linkable Cross-References
All \ref{...} commands become live markdown links with their resolved numbers:
% Original LaTeX
See Theorem~\ref{thm:main} and Equation~\eqref{eq:result}
% Our export
See [Theorem 3.2](#theorem-32) and [Equation (4.1)](#eq-41)
This means:
- Click to jump to referenced content
- LLMs can accurately answer "explain Equation (3.12)" queries
- Readers can navigate complex papers efficiently
3. Complete Asset Package Pro Feature
Enable "Include assets" to download a self-contained package — everything you need to view the paper locally with all figures and diagrams intact.
What you get:
arxiv_1706.03762.zip
├── arxiv_1706.03762.md # Main paper
└── assets/
├── figure_1.webp # Optimized images
├── figure_2.webp
├── diagram_architecture.svg # Crisp vector diagrams
└── diagram_attention.svg
Why this is powerful:
- Optimized formats — Images converted to
.webp(smaller, faster), diagrams exported as.svg(scalable, crisp) - Relative paths — Markdown automatically references
assets/figure_1.webp, so everything just works - Zero setup — Unzip and open the
.mdfile in any markdown viewer (Obsidian, VSCode, Typora, etc.) — all assets display immediately - Self-contained — The entire paper is portable. No broken links, no missing images, no external dependencies
- Vault-ready — Drop directly into your Obsidian vault or note-taking system
This is a complete artifact extracted from our parsed LaTeX AST — not a lossy conversion. Every figure and diagram from the original paper is preserved in modern, optimized formats.
4. LLM-Friendly Annotations
When "Include annotations" is checked:
<!-- LLM: annotations JSON at bottom -->
# Paper Title
[... paper content ...]
<!--ANNOTATIONS
[
{
"section": "Theorem 3.2",
"text": "Let X be a compact manifold...",
"annotation": "This is the key result of the paper"
}
]
ENDANNOTATIONS-->
Your notes are embedded as structured JSON in HTML comments — invisible to human readers, but easily parsed by LLMs or scripts.
5. Works Everywhere
Our markdown is compatible with:
-
Obsidian — Direct import with working links
-
Notion — Clean rendering with preserved structure
-
VSCode — Full preview support
-
GitHub — Renders perfectly in repos and READMEs
-
Any markdown editor — Standard CommonMark syntax
JSON Export
Our JSON format is machine-native and optimized for AI applications, LLM ingestion, and programmatic analysis.
Why JSON Over PDFs for LLMs?
PDFs are fundamentally visual formats designed for printing, not machine reading. For LLMs and AI applications, this creates serious problems:
| Problem | Our JSON | |
|---|---|---|
| Math extraction | ❌ Corrupted: x*y+1 = a/b | ✅ LaTeX preserved: x^{y+1} = \frac{a}{b} |
| Structure | ❌ Flattened page layout | ✅ Full semantic tree |
| References | ❌ Plain text, broken links | ✅ Machine-resolvable metadata |
| Chunking | ❌ Arbitrary page breaks | ✅ Semantic boundaries |
| Numbering | ❌ OCR errors, often missing | ✅ All elements numbered |
| Context | ❌ No type information | ✅ Explicit tags: "type": "theorem" |
| Tokens | ❌ Repeated headers/footers | ✅ Clean content only |
Bottom line: PDF extraction tools (GROBID, Nougat) try to reverse-engineer structure from visual layout. We provide the original semantic structure directly from LaTeX source.
Why JSON, Not Raw LaTeX?
Raw LaTeX may look clean, but it's problematic for language models:
Problems with Raw LaTeX
1. Macro hell
LLMs don't expand \newcommand or nested macros, so they often misinterpret notation or fail to parse math entirely.
\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\|#1\right\|}
% LLM sees: \norm{x} \in \R
% Doesn't understand this is a norm in real numbers
2. No numbering
Section and equation numbers aren't in the source — they're assigned at compile time. Without them, neither you nor the model can reference "Equation (3.12)" or "Section 4.1" accurately.
3. No clear boundaries
Sections, theorems, proofs, and equations are just text streams. Our parser makes them explicit blocks so context is clear.
4. Missing content order
LaTeX uses \input, \include, and \externaldocument to stitch together files — we resolve these so you get the complete paper in the right order.
JSON Structure
Our JSON format provides a hierarchical, semantic representation of papers:
{
"metadata": {
"title": "Attention Is All You Need",
"authors": [...],
"abstract": "...",
"arxiv_id": "1706.03762"
},
"content": [
{
"type": "section",
"number": "1",
"title": "Introduction",
"content": [...]
},
{
"type": "equation",
"number": "1",
"latex": "\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V",
"label": "eq:attention"
},
{
"type": "theorem",
"number": "3.1",
"title": "Main Result",
"content": [...],
"label": "thm:main"
}
],
"bibliography": [...]
}
Key Properties
- Macros expanded — All
\newcommanddefinitions resolved into plain LaTeX - Stable IDs — Every block has a unique identifier for referencing
- Numbered elements — All equations, sections, theorems have their final numbers
- Semantic types — Explicit tags for abstracts, proofs, definitions, lemmas, etc.
- Resolved references —
\ref{thm:main}links to the actual theorem block
Annotations in JSON
When "Include annotations" is enabled, your notes are added to the export:
{
"metadata": {...},
"content": [...],
"annotations": [
{
"section": "Theorem 3.2",
"text": "Let X be a compact manifold...",
"annotation": "This is the key result"
}
]
}
This makes it easy to:
- Feed annotated context to LLMs
- Build personal knowledge graphs
- Share insights with collaborators
- Train models on annotated papers
LaTeX Export
Download the raw LaTeX source with all macros expanded and content in the correct order.
What You Get
Our LaTeX export provides:
- Macro expansion — All
\newcommand,\def, and custom commands resolved - Complete content — All
\inputand\includefiles merged in order - Clean formatting — Unnecessary whitespace and comments removed
- Bibliography included — References appended as BibTeX entries
Macro Expansion
Unlike downloading raw source from arXiv (which often has dozens of \newcommand definitions), our export expands all macros:
Original source:
\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\|#1\right\|}
The function $f: \R \to \R$ satisfies $\norm{f(x)} < 1$.
Our export:
The function $f: \mathbb{R} \to \mathbb{R}$ satisfies $\left\|f(x)\right\| < 1$.
This makes the LaTeX:
-
✅ Easier to read and understand
-
✅ Portable across different LaTeX setups
-
✅ Reduce hallucination risk of LLMs
Get Started
Try these workflows on any paper:
- Browse ScienceStack
- Open a paper
- Click Download and select your format
- Follow the workflow guides above
Need help? Email support@sciencestack.ai
