JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching

Antoine Magron; Anna Dai; Mike Zhang; Syrielle Montariol; Antoine Bosselut

JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching

Antoine Magron, Anna Dai, Mike Zhang, Syrielle Montariol, Antoine Bosselut

TL;DR

JobSkape presents a framework for generating synthetic job postings aligned to the ESCO taxonomy to improve skill matching without costly annotations. It combines multi-skill sentence generation, prompt-tuned generation, and a self-refinement loop to produce SkillSkape, a high-quality synthetic dataset with offline metrics showing realism and coherence. The authors also propose an in-context learning pipeline for end-to-end skill extraction and mapping to ESCO, achieving competitive results on real-world benchmarks without retraining. The work highlights practical potential for scalable skill-gap analysis and evaluation while acknowledging limitations from closed models, language scope, bias, and taxonomy coverage.

Abstract

Recent approaches in skill matching, employing synthetic training data for classification or similarity model training, have shown promising results, reducing the need for time-consuming and expensive annotations. However, previous synthetic datasets have limitations, such as featuring only one skill per sentence and generally comprising short sentences. In this paper, we introduce JobSkape, a framework to generate synthetic data that tackles these limitations, specifically designed to enhance skill-to-taxonomy matching. Within this framework, we create SkillSkape, a comprehensive open-source synthetic dataset of job postings tailored for skill-matching tasks. We introduce several offline metrics that show that our dataset resembles real-world data. Additionally, we present a multi-step pipeline for skill extraction and matching tasks using large language models (LLMs), benchmarking against known supervised methodologies. We outline that the downstream evaluation results on real-world data can beat baselines, underscoring its efficacy and adaptability.

JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching

TL;DR

Abstract

Paper Structure (54 sections, 3 equations, 4 figures, 7 tables)

This paper contains 54 sections, 3 equations, 4 figures, 7 tables.

Introduction
Contributions.
Related Work
Synthetic Data Generation.
Synthetic Data for Job Postings.
Skill Matching.
The JobSkape Framework
The Label Space
Formal Approach
Prompt Tuning for Generation
Refinement of SkillSkape
Span Extraction.
Positive example.
Negative example.
Summary and Comparison
...and 39 more sections

Figures (4)

Figure 1: Three-step Skill Extraction and Matching Pipeline. We show our in-context learning pipeline for end-to-end skill matching. We use an LLM to extract skills from job ads, then do candidate selection using heuristics, and last, do skill matching with a constrained taxonomy.
Figure 2: Ablation study for In-context Learning
Figure 3: Rule-based, embedding-based, and hybrid candidate selection methods to select $n$ candidates. Note, since the hybrid method takes the union of rule-based and embedding-based methods, $n=5$ using the hybrid method would approximate $n \times 2$ actual number of actual candidates selected
Figure 4: Sentence length distribution in the three datasets. SkillSkape has much longer sentences. Decorte has very short sentences and low length variance.

JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching

TL;DR

Abstract

JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (4)