Table of Contents
Fetching ...

Model-Document Protocol for AI Search

Hongjin Qian, Zheng Liu

TL;DR

The paper addresses the gap between unstructured external data and the needs of LLM-based information seeking by introducing the Model–Document Protocol (MDP), a formal framework that transforms raw documents into compact, task-specific knowledge representations. It defines three pathways—agentic reasoning, memory grounding, and structured leveraging—to produce LLM-ready context, and instantiates this framework with MDP-Agent, which uses gist memories, diffusion-based exploration, memory-guided parallel synthesis, and a map–reduce synthesis process to build minimal yet sufficient knowledge spaces. In extensive experiments on GAIA and WebWalkerQA benchmarks, MDP-Agent outperforms vanilla RAG and other baselines, validating the protocol and demonstrating scalability and generalizability across LLMs. The work offers a principled, scalable approach to providing contextual intelligence to LLMs by reconciling data chaos with structured, reasoning-ready knowledge representations.

Abstract

AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.

Model-Document Protocol for AI Search

TL;DR

The paper addresses the gap between unstructured external data and the needs of LLM-based information seeking by introducing the Model–Document Protocol (MDP), a formal framework that transforms raw documents into compact, task-specific knowledge representations. It defines three pathways—agentic reasoning, memory grounding, and structured leveraging—to produce LLM-ready context, and instantiates this framework with MDP-Agent, which uses gist memories, diffusion-based exploration, memory-guided parallel synthesis, and a map–reduce synthesis process to build minimal yet sufficient knowledge spaces. In extensive experiments on GAIA and WebWalkerQA benchmarks, MDP-Agent outperforms vanilla RAG and other baselines, validating the protocol and demonstrating scalability and generalizability across LLMs. The work offers a principled, scalable approach to providing contextual intelligence to LLMs by reconciling data chaos with structured, reasoning-ready knowledge representations.

Abstract

AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.

Paper Structure

This paper contains 29 sections, 15 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The Model–Document Protocol (MDP) provides a standard protocol for bridging unstructured data to LLMs. Raw documents are first processed through basic cleaning and segmentation, followed by semantic abstraction that captures summaries, claims, and topics. These signals are then transformed into structured formats such as graphs or KV caches. Building on this foundation, MDP enables contextual intelligence for LLMs through multiple pathways, including constructing LLM-ready context via agentic information discovery, enhancing reasoning with memory grounding, and directly leveraging pre-encoded structures such as KV caches or graphs.
  • Figure 2: Illustration of a complex information-seeking task, where the answer depends on satisfying multiple conditions through horizontal exploration and vertical exploitation. MDP-Agent addresses this agentically by formulating intents, decomposing them into atomic queries, and expanding coverage via diffusion to gather raw documents. Resolved intents advance iteratively to the next, with documents processed in parallel and synthesized through a map–reduce procedure into subspace knowledge, which is then transformed into an LLM-ready context.
  • Figure 3: Analysis of MDP-Agent on three perspectives: (a) effect of the central reasoning LLM, comparing MDP-Agent with a TIR baseline (Search-o1); (b) transferability of MDP-Agent’s LLM-ready context, compared with RAG and TIR across downstream LLMs for answer generation; and (c) impact of the diffusion-search budget on performance and resulting retrieval dynamics.