OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

Michael J. Bommarito

OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

Michael J. Bommarito

TL;DR

OpenGloss presents a synthetic encyclopedic dictionary and semantic knowledge graph generated by a four-stage, schema-validated pipeline that attains WordNet-scale breadth (537K senses across 150K lexemes) while adding encyclopedic context, etymology, usage examples, collocations, and 9.14M semantic edges. The method leverages a modular, multi-agent LLM framework validated by automated quality assurance to produce a cost-effective resource (under $1,000) completed in 96 hours. OpenGloss is positioned as complementary to WordNet, BabelNet, and ConceptNet, offering pragmatic, learner-focused lexicography with rich educational content and dynamic update potential. The work highlights the trade-offs of synthetic generation, emphasizes reproducibility, and envisions broad applications in education, NLP, and knowledge-grounded language models, while noting limitations and laws and ethics considerations around generated content.

Abstract

We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.

OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

TL;DR

Abstract

OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)