Table of Contents
Fetching ...

OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

Michael J. Bommarito

TL;DR

OpenGloss presents a synthetic encyclopedic dictionary and semantic knowledge graph generated by a four-stage, schema-validated pipeline that attains WordNet-scale breadth (537K senses across 150K lexemes) while adding encyclopedic context, etymology, usage examples, collocations, and 9.14M semantic edges. The method leverages a modular, multi-agent LLM framework validated by automated quality assurance to produce a cost-effective resource (under $1,000) completed in 96 hours. OpenGloss is positioned as complementary to WordNet, BabelNet, and ConceptNet, offering pragmatic, learner-focused lexicography with rich educational content and dynamic update potential. The work highlights the trade-offs of synthetic generation, emphasizes reproducibility, and envisions broad applications in education, NLP, and knowledge-grounded language models, while noting limitations and laws and ethics considerations around generated content.

Abstract

We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.

OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

TL;DR

OpenGloss presents a synthetic encyclopedic dictionary and semantic knowledge graph generated by a four-stage, schema-validated pipeline that attains WordNet-scale breadth (537K senses across 150K lexemes) while adding encyclopedic context, etymology, usage examples, collocations, and 9.14M semantic edges. The method leverages a modular, multi-agent LLM framework validated by automated quality assurance to produce a cost-effective resource (under $1,000) completed in 96 hours. OpenGloss is positioned as complementary to WordNet, BabelNet, and ConceptNet, offering pragmatic, learner-focused lexicography with rich educational content and dynamic update potential. The work highlights the trade-offs of synthetic generation, emphasizes reproducibility, and envisions broad applications in education, NLP, and knowledge-grounded language models, while noting limitations and laws and ethics considerations around generated content.

Abstract

We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.

Paper Structure

This paper contains 55 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: OpenGloss generation pipeline. The four-stage process combines multi-agent LLM generation with deterministic graph construction and systematic enrichment. Pydantic schema validation ensures type safety at each stage. The entire pipeline completed in 96 hours at under $1,000 using gpt-5-nano, with automated QA using Claude Sonnet 4.5.
  • Figure 2: OpenGloss data model hierarchy. The Pydantic schema organizes information at three levels: Lexeme (root container with etymology and encyclopedia), Part of Speech (POS-specific with 1-4 senses and morphology), and Lexical Sense (atomic unit with definition and semantic neighborhood). The hierarchical structure supports both computational access and traditional lexicographic organization.
  • Figure 3: Representative lexeme entries from OpenGloss showing core structure: multiple senses with definitions, semantic relationships (synonyms, hypernyms, hyponyms), usage examples, encyclopedic context, and etymology. Algorithm represents technical vocabulary; photosynthesis illustrates scientific terminology.
  • Figure 4: Distribution of number of senses per lexeme in OpenGloss. Most lexemes (61.5%) have 2-4 senses, with a long tail of highly polysemous lexemes reaching up to 24 senses. The median is 3 senses per lexeme.