Table of Contents
Fetching ...

JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching

Antoine Magron, Anna Dai, Mike Zhang, Syrielle Montariol, Antoine Bosselut

TL;DR

JobSkape presents a framework for generating synthetic job postings aligned to the ESCO taxonomy to improve skill matching without costly annotations. It combines multi-skill sentence generation, prompt-tuned generation, and a self-refinement loop to produce SkillSkape, a high-quality synthetic dataset with offline metrics showing realism and coherence. The authors also propose an in-context learning pipeline for end-to-end skill extraction and mapping to ESCO, achieving competitive results on real-world benchmarks without retraining. The work highlights practical potential for scalable skill-gap analysis and evaluation while acknowledging limitations from closed models, language scope, bias, and taxonomy coverage.

Abstract

Recent approaches in skill matching, employing synthetic training data for classification or similarity model training, have shown promising results, reducing the need for time-consuming and expensive annotations. However, previous synthetic datasets have limitations, such as featuring only one skill per sentence and generally comprising short sentences. In this paper, we introduce JobSkape, a framework to generate synthetic data that tackles these limitations, specifically designed to enhance skill-to-taxonomy matching. Within this framework, we create SkillSkape, a comprehensive open-source synthetic dataset of job postings tailored for skill-matching tasks. We introduce several offline metrics that show that our dataset resembles real-world data. Additionally, we present a multi-step pipeline for skill extraction and matching tasks using large language models (LLMs), benchmarking against known supervised methodologies. We outline that the downstream evaluation results on real-world data can beat baselines, underscoring its efficacy and adaptability.

JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching

TL;DR

JobSkape presents a framework for generating synthetic job postings aligned to the ESCO taxonomy to improve skill matching without costly annotations. It combines multi-skill sentence generation, prompt-tuned generation, and a self-refinement loop to produce SkillSkape, a high-quality synthetic dataset with offline metrics showing realism and coherence. The authors also propose an in-context learning pipeline for end-to-end skill extraction and mapping to ESCO, achieving competitive results on real-world benchmarks without retraining. The work highlights practical potential for scalable skill-gap analysis and evaluation while acknowledging limitations from closed models, language scope, bias, and taxonomy coverage.

Abstract

Recent approaches in skill matching, employing synthetic training data for classification or similarity model training, have shown promising results, reducing the need for time-consuming and expensive annotations. However, previous synthetic datasets have limitations, such as featuring only one skill per sentence and generally comprising short sentences. In this paper, we introduce JobSkape, a framework to generate synthetic data that tackles these limitations, specifically designed to enhance skill-to-taxonomy matching. Within this framework, we create SkillSkape, a comprehensive open-source synthetic dataset of job postings tailored for skill-matching tasks. We introduce several offline metrics that show that our dataset resembles real-world data. Additionally, we present a multi-step pipeline for skill extraction and matching tasks using large language models (LLMs), benchmarking against known supervised methodologies. We outline that the downstream evaluation results on real-world data can beat baselines, underscoring its efficacy and adaptability.
Paper Structure (54 sections, 3 equations, 4 figures, 7 tables)

This paper contains 54 sections, 3 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Three-step Skill Extraction and Matching Pipeline. We show our in-context learning pipeline for end-to-end skill matching. We use an LLM to extract skills from job ads, then do candidate selection using heuristics, and last, do skill matching with a constrained taxonomy.
  • Figure 2: Ablation study for In-context Learning
  • Figure 3: Rule-based, embedding-based, and hybrid candidate selection methods to select $n$ candidates. Note, since the hybrid method takes the union of rule-based and embedding-based methods, $n=5$ using the hybrid method would approximate $n \times 2$ actual number of actual candidates selected
  • Figure 4: Sentence length distribution in the three datasets. SkillSkape has much longer sentences. Decorte has very short sentences and low length variance.