S2Doc -- Spatial-Semantic Document Format
Sebastian Kempf, Frank Puppe
TL;DR
S2Doc tackles the lack of a unified standard for document and table modeling by introducing a modular data format that jointly represents spatial, logical, semantic, and ontological information. It integrates a spatial model (Spaces and Regions), a flexible logical graph (ReferenceGraph) for document structure, a functional layer for role labeling, a semantic layer that grounds table data in headers and values, and an ontological layer via a SemanticKnowledgeGraph linked by a SemanticReferenceGraph. The approach enables multi-page support, task-agnostic adaptation, and reproducible pipelines with explicit uncertainty annotations, aiming to simplify integration and evaluation of document/table understanding methods. The authors also outline a workflow and a roadmap for framework development and format conversion to facilitate adoption across existing pipelines and datasets.
Abstract
Documents are a common way to store and share information, with tables being an important part of many documents. However, there is no real common understanding of how to model documents and tables in particular. Because of this lack of standardization, most scientific approaches have their own way of modeling documents and tables, leading to a variety of different data structures and formats that are not directly compatible. Furthermore, most data models focus on either the spatial or the semantic structure of a document, neglecting the other aspect. To address this, we developed S2Doc, a flexible data structure for modeling documents and tables that combines both spatial and semantic information in a single format. It is designed to be easily extendable to new tasks and supports most modeling approaches for documents and tables, including multi-page documents. To the best of our knowledge, it is the first approach of its kind to combine all these aspects in a single format.
