Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data
Stephen Meisenbacher, Svetlozar Nestorov, Peter Norlander
TL;DR
This paper presents JAAT, an open-source toolkit that extracts structured O*NET-aligned features from the NLx job-ad corpus to build a public-use, monthly-aggregated labor-market dataset spanning 2015–2025. By mapping ad language to O*NET’s content model and cross-walking to ESCO where beneficial, the authors produce billions of data points (e.g., tasks, skills, tools, wages) and demonstrate convergent validity through cross-tool and external benchmarks. The work combines iterative, human-in-the-loop model development with multiple modules (SkillMatch, TaskMatch, TitleMatch, FirmExtract, WageExtract, JobTag) and rigorous validation, delivering large-scale, high-precision labor-market signals while explicitly acknowledging limitations of ad data and taxonomy-based extraction. The resulting dataset supports national-trend analyses and fine-grained, occupation-level insights, with clear pathways for education, workforce development, and policy research; however, the authors emphasize careful interpretation and the need for benchmarks and reproducibility aids in future work.
Abstract
Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-a-Judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 - 2025. We illustrate the potential for research and future uses in education and workforce development.
