Table of Contents
Fetching ...

Computational Job Market Analysis with Natural Language Processing

Mike Zhang

TL;DR

This work develops a computational framework for analyzing labor market demands through NLP by extracting and grounding skills from job postings in multilingual contexts. It introduces a data-centric pipeline (annotating data, de-identification with JobStack, and data maps for active learning) and advances skill extraction through weak supervision, domain-adaptive pretraining (ESCOXLM-R), and retrieval-augmented models (NNOSE). The thesis contributes two open corpora (JobStack and SkillSpan) and two Danish datasets (Kompetencer) to support cross-lingual SE, plus methodological innovations in active learning (Cartography Active Learning) and retrieval-augmented NLP for scalable, domain-specific job market analytics. Across English and Danish, the ESCO-informed pre-training and retrieval-augmented approaches yield substantial gains, particularly for short-span and long-tail skills, enabling more robust labor-market insights and better job-matching signals. Collectively, these methods advance transparent, multilingual NLP for labor-market analytics with practical implications for policymakers, platforms, and workers, while highlighting limitations and directions for extending taxonomy coverage and cross-lingual transfer. The mathematical formulation of SE as a sequence-labeling task with BIO tags, plus evaluation via Span-F1 and entity-linking metrics, underpins the rigorous, repeatable benchmarking of models across datasets and languages.

Abstract

[Abridged Abstract] Recent technological advances underscore labor market dynamics, yielding significant consequences for employment prospects and increasing job vacancy data across platforms and languages. Aggregating such data holds potential for valuable insights into labor market demands, new skills emergence, and facilitating job matching for various stakeholders. However, despite prevalent insights in the private sector, transparent language technology systems and data for this domain are lacking. This thesis investigates Natural Language Processing (NLP) technology for extracting relevant information from job descriptions, identifying challenges including scarcity of training data, lack of standardized annotation guidelines, and shortage of effective extraction methods from job ads. We frame the problem, obtaining annotated data, and introducing extraction methodologies. Our contributions include job description datasets, a de-identification dataset, and a novel active learning algorithm for efficient model training. We propose skill extraction using weak supervision, a taxonomy-aware pre-training methodology adapting multilingual language models to the job market domain, and a retrieval-augmented model leveraging multiple skill extraction datasets to enhance overall performance. Finally, we ground extracted information within a designated taxonomy.

Computational Job Market Analysis with Natural Language Processing

TL;DR

This work develops a computational framework for analyzing labor market demands through NLP by extracting and grounding skills from job postings in multilingual contexts. It introduces a data-centric pipeline (annotating data, de-identification with JobStack, and data maps for active learning) and advances skill extraction through weak supervision, domain-adaptive pretraining (ESCOXLM-R), and retrieval-augmented models (NNOSE). The thesis contributes two open corpora (JobStack and SkillSpan) and two Danish datasets (Kompetencer) to support cross-lingual SE, plus methodological innovations in active learning (Cartography Active Learning) and retrieval-augmented NLP for scalable, domain-specific job market analytics. Across English and Danish, the ESCO-informed pre-training and retrieval-augmented approaches yield substantial gains, particularly for short-span and long-tail skills, enabling more robust labor-market insights and better job-matching signals. Collectively, these methods advance transparent, multilingual NLP for labor-market analytics with practical implications for policymakers, platforms, and workers, while highlighting limitations and directions for extending taxonomy coverage and cross-lingual transfer. The mathematical formulation of SE as a sequence-labeling task with BIO tags, plus evaluation via Span-F1 and entity-linking metrics, underpins the rigorous, repeatable benchmarking of models across datasets and languages.

Abstract

[Abridged Abstract] Recent technological advances underscore labor market dynamics, yielding significant consequences for employment prospects and increasing job vacancy data across platforms and languages. Aggregating such data holds potential for valuable insights into labor market demands, new skills emergence, and facilitating job matching for various stakeholders. However, despite prevalent insights in the private sector, transparent language technology systems and data for this domain are lacking. This thesis investigates Natural Language Processing (NLP) technology for extracting relevant information from job descriptions, identifying challenges including scarcity of training data, lack of standardized annotation guidelines, and shortage of effective extraction methods from job ads. We frame the problem, obtaining annotated data, and introducing extraction methodologies. Our contributions include job description datasets, a de-identification dataset, and a novel active learning algorithm for efficient model training. We propose skill extraction using weak supervision, a taxonomy-aware pre-training methodology adapting multilingual language models to the job market domain, and a retrieval-augmented model leveraging multiple skill extraction datasets to enhance overall performance. Finally, we ground extracted information within a designated taxonomy.
Paper Structure (284 sections, 27 equations, 44 figures, 61 tables, 4 algorithms)

This paper contains 284 sections, 27 equations, 44 figures, 61 tables, 4 algorithms.

Figures (44)

  • Figure 1: A Graphical Illustration of ESCO. We show the simplified structure of the hierarchical ESCO taxonomy. It contains four levels of skills, where each skill can be connected to a specific occupation.
  • Figure 2: Sequence Labeling with Neural Networks. The flow for DL-based sequence labeling. The input sequence goes through 1) a layer that tokenizes the input, 2) an encoder that transforms the representations into meaningful vectors, 3) an output layer where the tags (B-Skill, I-Skill, O) get predicted from each token vector.
  • Figure 3: Pool-based Active Learning Cycle. The human annotator starts to annotate a small labeled set $\mathcal{L}$ to train an initial model. Then, the model is applied to an unlabeled pool of data $\mathcal{U}$. Given a certain acquisition function or score, the model selects the instances that are the most informative. This set of instances is given to the oracle to annotate.
  • Figure 4: Snippet JobStack. Snippet of a job posting, full job posting can be found in \ref{['app:A1']}.
  • Figure 5: Full Data Maps for AGNews & TREC. AGNews (120,000 instances) on the left, and TREC (5,452 instances) on the right, both w.r.t. an MLP training for ten epochs. The x-axis shows variability and the y-axis the confidence. The colors and shapes indicate the correctness.
  • ...and 39 more figures