Table of Contents
Fetching ...

S2Doc -- Spatial-Semantic Document Format

Sebastian Kempf, Frank Puppe

TL;DR

S2Doc tackles the lack of a unified standard for document and table modeling by introducing a modular data format that jointly represents spatial, logical, semantic, and ontological information. It integrates a spatial model (Spaces and Regions), a flexible logical graph (ReferenceGraph) for document structure, a functional layer for role labeling, a semantic layer that grounds table data in headers and values, and an ontological layer via a SemanticKnowledgeGraph linked by a SemanticReferenceGraph. The approach enables multi-page support, task-agnostic adaptation, and reproducible pipelines with explicit uncertainty annotations, aiming to simplify integration and evaluation of document/table understanding methods. The authors also outline a workflow and a roadmap for framework development and format conversion to facilitate adoption across existing pipelines and datasets.

Abstract

Documents are a common way to store and share information, with tables being an important part of many documents. However, there is no real common understanding of how to model documents and tables in particular. Because of this lack of standardization, most scientific approaches have their own way of modeling documents and tables, leading to a variety of different data structures and formats that are not directly compatible. Furthermore, most data models focus on either the spatial or the semantic structure of a document, neglecting the other aspect. To address this, we developed S2Doc, a flexible data structure for modeling documents and tables that combines both spatial and semantic information in a single format. It is designed to be easily extendable to new tasks and supports most modeling approaches for documents and tables, including multi-page documents. To the best of our knowledge, it is the first approach of its kind to combine all these aspects in a single format.

S2Doc -- Spatial-Semantic Document Format

TL;DR

S2Doc tackles the lack of a unified standard for document and table modeling by introducing a modular data format that jointly represents spatial, logical, semantic, and ontological information. It integrates a spatial model (Spaces and Regions), a flexible logical graph (ReferenceGraph) for document structure, a functional layer for role labeling, a semantic layer that grounds table data in headers and values, and an ontological layer via a SemanticKnowledgeGraph linked by a SemanticReferenceGraph. The approach enables multi-page support, task-agnostic adaptation, and reproducible pipelines with explicit uncertainty annotations, aiming to simplify integration and evaluation of document/table understanding methods. The authors also outline a workflow and a roadmap for framework development and format conversion to facilitate adoption across existing pipelines and datasets.

Abstract

Documents are a common way to store and share information, with tables being an important part of many documents. However, there is no real common understanding of how to model documents and tables in particular. Because of this lack of standardization, most scientific approaches have their own way of modeling documents and tables, leading to a variety of different data structures and formats that are not directly compatible. Furthermore, most data models focus on either the spatial or the semantic structure of a document, neglecting the other aspect. To address this, we developed S2Doc, a flexible data structure for modeling documents and tables that combines both spatial and semantic information in a single format. It is designed to be easily extendable to new tasks and supports most modeling approaches for documents and tables, including multi-page documents. To the best of our knowledge, it is the first approach of its kind to combine all these aspects in a single format.

Paper Structure

This paper contains 17 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Example table from huang2023improvingtablestructurerecognition and its five models. The physical model is a set of bounding boxes that may have content. Because it is assumed that the input medium is a PDF, the cell content is known. Depending on the extraction approach, 'Training' and 'Dataset' may be represented as two separate cells or as one cell. The logical model orders these cells into rows and columns, either using a grid/matrix structure or a graph structure. The graph model connects cells in the same row or column, either directly or indirectly using intermediate nodes/objects. Depending on the approach used for structure recognition, there may be intermediate states of logical structure. For example, it is not unambiguous, which rows 'Methods' and 'WAvg.F1' are in, because they are in between two of them. They could be put in one of the two columns, put in between the two rows, or be marked as a spanning cell. The same applies to the 'IoU' cell, which is between two columns. The functional model identifies row and column label cells based on the results of the previous models. The semantic model combines all previous models to form a set of tuples $\langle [ \textrm{row header(s)} ], [ \textrm{column header(s)} ], [ \textrm{value} ] \rangle$. Within this step, the resolution of ambiguities from the logical model is crucial, as it directly affects the resulting tuples. If 'IoU' was not identified as belonging to both 'C3' and 'C5' columns, instead of $\langle [ \textrm{VAST} ], [ \textrm{IoU},\textrm{0.5} ], [ \textrm{66.8} ] \rangle$, 'IoU' would be missing. Finally, the ontological model is the annotation of the table with background knowledge. The background knowledge is not contained in the table itself, but may be provided by the surrounding document or an external knowledge base. The annotation is achieved by associating elements with concepts and entities. In this example, 'VAST' is associated with the concept 'Model' and 'IoU' with the entity 'IoU'.
  • Figure 2: Representation of a table extraction workflow using S2Doc, showing how each processing stage is modeled across abstraction levels using the example table also shown in Fig. \ref{['fig:example']}. The physical structure is built based on the input PDF document and the output of a table detection system. A table structure recognition module builds the logical structure in one of two ways. The functional structure marks header cells and the semantic structure can be generated based on the previous results. The ontological structure enables the annotation of elements on every level with knowledge.