Table of Contents
Fetching ...

OmniStruct: Universal Text-to-Structure Generation across Diverse Schemas

James Y. Huang, Wenxuan Zhou, Nan Xu, Fei Wang, Qin Liu, Sheng Zhang, Hoifung Poon, Muhao Chen

TL;DR

OmniStruct addresses the need for a universal benchmark for text-to-structure generation by unifying diverse schemas (NER, RE, text-to-table, function calling) under a JSON-schema framework. It introduces OmniStruct, a broad benchmark assembled from multiple datasets and converted into a schema-guided format, and demonstrates that large models like GPT-4o dominate overall performance while smaller models can close the gap through synthetic instruction tuning. A three-step data-synthesis pipeline (task filtering, task synthesis, instance generation/validation) enables distillation of GPT-4o capabilities into a compact model (OmniStruct-8B), achieving strong results on several tasks and highlighting the potential for cost-effective universal text-to-structure models. The study also shows that while constrained decoding helps minimally, schema adherence alone does not guarantee high content quality, and it acknowledges limitations in focusing solely on JSON with future work extending to additional structured formats.

Abstract

The ability of Large Language Models (LLMs) to generate structured outputs that follow arbitrary schemas is crucial to a wide range of downstream tasks that require diverse structured representations of results such as information extraction, table generation, and function calling. While modern LLMs excel in generating unstructured responses in natural language, whether this advancement translates to a strong performance on text-to-structure tasks remains unclear. To bridge this gap, we first introduce OmniStruct, a comprehensive benchmark for assessing LLMs' capabilities on diverse text-to-structure tasks such as information extraction, table generation, and function calling. We build OmniStruct by identifying existing datasets across a wide range of tasks that are suitable for a structured answer format, and adapting them under a unified text-to-structure problem setting. To facilitate the development of efficient text-to-structure models, we collect high-quality training data via synthetic task generation. Without using any supervised data for OmniStruct tasks, our experiments demonstrate the possibility of fine-tuning much smaller models on synthetic data into universal structured generation models that can rival the performance of GPT-4o.

OmniStruct: Universal Text-to-Structure Generation across Diverse Schemas

TL;DR

OmniStruct addresses the need for a universal benchmark for text-to-structure generation by unifying diverse schemas (NER, RE, text-to-table, function calling) under a JSON-schema framework. It introduces OmniStruct, a broad benchmark assembled from multiple datasets and converted into a schema-guided format, and demonstrates that large models like GPT-4o dominate overall performance while smaller models can close the gap through synthetic instruction tuning. A three-step data-synthesis pipeline (task filtering, task synthesis, instance generation/validation) enables distillation of GPT-4o capabilities into a compact model (OmniStruct-8B), achieving strong results on several tasks and highlighting the potential for cost-effective universal text-to-structure models. The study also shows that while constrained decoding helps minimally, schema adherence alone does not guarantee high content quality, and it acknowledges limitations in focusing solely on JSON with future work extending to additional structured formats.

Abstract

The ability of Large Language Models (LLMs) to generate structured outputs that follow arbitrary schemas is crucial to a wide range of downstream tasks that require diverse structured representations of results such as information extraction, table generation, and function calling. While modern LLMs excel in generating unstructured responses in natural language, whether this advancement translates to a strong performance on text-to-structure tasks remains unclear. To bridge this gap, we first introduce OmniStruct, a comprehensive benchmark for assessing LLMs' capabilities on diverse text-to-structure tasks such as information extraction, table generation, and function calling. We build OmniStruct by identifying existing datasets across a wide range of tasks that are suitable for a structured answer format, and adapting them under a unified text-to-structure problem setting. To facilitate the development of efficient text-to-structure models, we collect high-quality training data via synthetic task generation. Without using any supervised data for OmniStruct tasks, our experiments demonstrate the possibility of fine-tuning much smaller models on synthetic data into universal structured generation models that can rival the performance of GPT-4o.

Paper Structure

This paper contains 24 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of our unified problem definition for text-to-structure tasks in OmniStruct. Take NER as an example, each task instance consists of a task instruction and input that jointly defines the task, an answer JSON schema that defines the expected answer structure, and a JSON groundtruth answer that strictly follows the given schema.
  • Figure 2: Spectrum of OmniStruct tasks based on how extractive the task is.
  • Figure 3: Parsing Error Rate of different LLMs. Parsing errors are rare in general and thus have little impact on the model's performance.
  • Figure 4: Categories of synthetic text-to-structure data.