Deep Learning-based Computational Job Market Analysis: A Survey on Skill Extraction and Classification from Job Postings
Elena Senger, Mike Zhang, Rob van der Goot, Barbara Plank
TL;DR
This survey addresses NLP-driven deep learning approaches for skill extraction and classification from job postings, clarifying terminology and cataloging publicly available datasets to support reproducibility. It systematically reviews three extraction paradigms (span labeling, binary classification, coarse-label extraction) and two classification paradigms (direct matching and extraction-assisted labeling), highlighting the shift toward domain-adapted transformers, XMLC, and LLM-based re-ranking. The work aggregates eight datasets (e.g., SAYFULLINA, GREEN, FIJO, SKILLSPAN, KOMPETENCER, DECORTE, GNEHM-ICT, BHOLA) and analyzes modeling progress from LSTMs to BERT-based and cross-lingual approaches, including ESCO-grounded mappings. The findings emphasize ESCO/O*NET-based standardization and the increasing role of LLMs in data augmentation and label mapping, while pointing to future research on implicit skills, benchmarking, and cross-industry transfer. Overall, the survey provides a cohesive NLP-centric perspective that informs methodology, data resources, and open challenges in automated skill extraction and classification from job postings.
Abstract
Recent years have brought significant advances to Natural Language Processing (NLP), which enabled fast progress in the field of computational job market analysis. Core tasks in this application domain are skill extraction and classification from job postings. Because of its quick growth and its interdisciplinary nature, there is no exhaustive assessment of this emerging field. This survey aims to fill this gap by providing a comprehensive overview of deep learning methodologies, datasets, and terminologies specific to NLP-driven skill extraction and classification. Our comprehensive cataloging of publicly available datasets addresses the lack of consolidated information on dataset creation and characteristics. Finally, the focus on terminology addresses the current lack of consistent definitions for important concepts, such as hard and soft skills, and terms relating to skill extraction and classification.
