Computational Job Market Analysis with Natural Language Processing
Mike Zhang
TL;DR
This work develops a computational framework for analyzing labor market demands through NLP by extracting and grounding skills from job postings in multilingual contexts. It introduces a data-centric pipeline (annotating data, de-identification with JobStack, and data maps for active learning) and advances skill extraction through weak supervision, domain-adaptive pretraining (ESCOXLM-R), and retrieval-augmented models (NNOSE). The thesis contributes two open corpora (JobStack and SkillSpan) and two Danish datasets (Kompetencer) to support cross-lingual SE, plus methodological innovations in active learning (Cartography Active Learning) and retrieval-augmented NLP for scalable, domain-specific job market analytics. Across English and Danish, the ESCO-informed pre-training and retrieval-augmented approaches yield substantial gains, particularly for short-span and long-tail skills, enabling more robust labor-market insights and better job-matching signals. Collectively, these methods advance transparent, multilingual NLP for labor-market analytics with practical implications for policymakers, platforms, and workers, while highlighting limitations and directions for extending taxonomy coverage and cross-lingual transfer. The mathematical formulation of SE as a sequence-labeling task with BIO tags, plus evaluation via Span-F1 and entity-linking metrics, underpins the rigorous, repeatable benchmarking of models across datasets and languages.
Abstract
[Abridged Abstract] Recent technological advances underscore labor market dynamics, yielding significant consequences for employment prospects and increasing job vacancy data across platforms and languages. Aggregating such data holds potential for valuable insights into labor market demands, new skills emergence, and facilitating job matching for various stakeholders. However, despite prevalent insights in the private sector, transparent language technology systems and data for this domain are lacking. This thesis investigates Natural Language Processing (NLP) technology for extracting relevant information from job descriptions, identifying challenges including scarcity of training data, lack of standardized annotation guidelines, and shortage of effective extraction methods from job ads. We frame the problem, obtaining annotated data, and introducing extraction methodologies. Our contributions include job description datasets, a de-identification dataset, and a novel active learning algorithm for efficient model training. We propose skill extraction using weak supervision, a taxonomy-aware pre-training methodology adapting multilingual language models to the job market domain, and a retrieval-augmented model leveraging multiple skill extraction datasets to enhance overall performance. Finally, we ground extracted information within a designated taxonomy.
