Table of Contents
Fetching ...

DeepXiv-SDK: An Agentic Data Interface for Scientific Literature

Hongjin Qian, Ziyi Xia, Ze Liu, Jianlyu Chen, Kun Luo, Minghao Qin, Chaofan Li, Lei Xiong, Junwei Lan, Sen Wang, Zhengyang Liang, Yingxia Shao, Defu Lian, Zheng Liu

TL;DR

This work introduces DeepXiv-SDK, which enables progressive access aligned with how agents allocate attention and reading budget, and supports multi-faceted retrieval and aggregation over paper attributes, enabling constraint-driven search and curation over paper sets.

Abstract

LLM-agents are increasingly used to accelerate the progress of scientific research. Yet a persistent bottleneck is data access: agents not only lack readily available tools for retrieval, but also have to work with unstrcutured, human-centric data on the Internet, such as HTML web-pages and PDF files, leading to excessive token consumption, limit working efficiency, and brittle evidence look-up. This gap motivates the development of \textit{an agentic data interface}, which is designed to enable agents to access and utilize scientific literature in a more effective, efficient, and cost-aware manner. In this paper, we introduce DeepXiv-SDK, which offers a three-layer agentic data interface for scientific literature. 1) Data Layer, which transforms unstructured, human-centric data into normalized and structured representations in JSON format, improving data usability and enabling progressive accessibility of the data. 2) Service Layer, which presents readily available tools for data access and ad-hoc retrieval. It also enables a rich form of agent usage, including CLI, MCP, and Python SDK. 3) Application Layer, which creates a built-in agent, packaging basic tools from the service layer to support complex data access demands. DeepXiv-SDK currently supports the complete ArXiv corpus, and is synchronized daily to incorporate new releases. It is designed to extend to all common open-access corpora, such as PubMed Central, bioRxiv, medRxiv, and chemRxiv. We release RESTful APIs, an open-source Python SDK, and a web demo showcasing deep search and deep research workflows. DeepXiv-SDK is free to use with registration.

DeepXiv-SDK: An Agentic Data Interface for Scientific Literature

TL;DR

This work introduces DeepXiv-SDK, which enables progressive access aligned with how agents allocate attention and reading budget, and supports multi-faceted retrieval and aggregation over paper attributes, enabling constraint-driven search and curation over paper sets.

Abstract

LLM-agents are increasingly used to accelerate the progress of scientific research. Yet a persistent bottleneck is data access: agents not only lack readily available tools for retrieval, but also have to work with unstrcutured, human-centric data on the Internet, such as HTML web-pages and PDF files, leading to excessive token consumption, limit working efficiency, and brittle evidence look-up. This gap motivates the development of \textit{an agentic data interface}, which is designed to enable agents to access and utilize scientific literature in a more effective, efficient, and cost-aware manner. In this paper, we introduce DeepXiv-SDK, which offers a three-layer agentic data interface for scientific literature. 1) Data Layer, which transforms unstructured, human-centric data into normalized and structured representations in JSON format, improving data usability and enabling progressive accessibility of the data. 2) Service Layer, which presents readily available tools for data access and ad-hoc retrieval. It also enables a rich form of agent usage, including CLI, MCP, and Python SDK. 3) Application Layer, which creates a built-in agent, packaging basic tools from the service layer to support complex data access demands. DeepXiv-SDK currently supports the complete ArXiv corpus, and is synchronized daily to incorporate new releases. It is designed to extend to all common open-access corpora, such as PubMed Central, bioRxiv, medRxiv, and chemRxiv. We release RESTful APIs, an open-source Python SDK, and a web demo showcasing deep search and deep research workflows. DeepXiv-SDK is free to use with registration.
Paper Structure (25 sections, 2 figures, 4 tables)

This paper contains 25 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: System overview of DeepXiv-SDK. The system ingests and enriches papers into a normalized schema with budget-aware, progressive access views (header-first triage, section-level navigation, and evidence-level verification), and serves them via a REST API backed by hybrid retrieval (lexical and dense indexes). These capabilities support agentic applications including deep search, deep research, and reproducible, evidence-grounded comparison.
  • Figure 2: Evaluation of DeepXiv-SDK. (a) Agentic paper search on 50 multi-constraint queries with unique targets: DeepXiv achieves higher Recall@1/10 with substantially lower latency than existing agentic search platforms. (b) Deep research QA on 47 queries: DeepXiv reduces token and time cost while improving answer quality compared to a traditional Search&Read pipeline.