Standard Occupation Classifier -- A Natural Language Processing Approach
Sidharth Rony, Jack Patman
TL;DR
The paper tackles automated SOC-code assignment for noisy online job advertisements by deploying a BERT-based ensemble that fuses job titles, descriptions, and extracted skills. It systematically investigates truncation strategies, preprocessing, and hierarchical versus flat classification, reporting that a combined title-description-skills ensemble yields the strongest performance, with up to about 61% accuracy at the most granular tier and 72% at a coarser tier. The work demonstrates the viability of transformer-based approaches for cross-language SOC mapping (UK ONS SOC and US O*NET SOC) using real-world data and highlights practical challenges such as long text, data sparsity at deep taxonomy levels, and taxonomy revisions. Collectively, the findings offer a scalable, data-driven method for real-time labor market analysis and occupation-specific demand insights from job postings, with implications for workforce planning and policy evaluation.
Abstract
Standard Occupational Classifiers (SOC) are systems used to categorize and classify different types of jobs and occupations based on their similarities in terms of job duties, skills, and qualifications. Integrating these facets with Big Data from job advertisement offers the prospect to investigate labour demand that is specific to various occupations. This project investigates the use of recent developments in natural language processing to construct a classifier capable of assigning an occupation code to a given job advertisement. We develop various classifiers for both UK ONS SOC and US O*NET SOC, using different Language Models. We find that an ensemble model, which combines Google BERT and a Neural Network classifier while considering job title, description, and skills, achieved the highest prediction accuracy. Specifically, the ensemble model exhibited a classification accuracy of up to 61% for the lower (or fourth) tier of SOC, and 72% for the third tier of SOC. This model could provide up to date, accurate information on the evolution of the labour market using job advertisements.
