Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE
Christian Møller Dahl, Torben Johansen, Christian Vedel
TL;DR
The paper tackles the heavy burden of manually coding historical occupational descriptions into HISCO codes. It introduces OccCANINE, a fine-tuned CANINE transformer that learns semantic mappings from 14 million occupation–HISCO pairs across 13 languages, enabling fast and reproducible coding without text cleaning. The model achieves high performance (approximately 93.6% accuracy, 95.5% precision, 98.2% recall, and F1 around 0.960) and demonstrates robustness to out-of-distribution data, with guidance on language-specific thresholds and potential for quick fine-tuning on niche domains. This approach democratizes access to standardized occupational data, facilitating large-scale economic history analyses and offering a blueprint for applying similar methods to other historical classification schemes.
Abstract
This paper introduces a new tool, OccCANINE, to automatically transform occupational descriptions into the HISCO classification system. The manual work involved in processing and classifying occupational descriptions is error-prone, tedious, and time-consuming. We finetune a preexisting language model (CANINE) to do this automatically, thereby performing in seconds and minutes what previously took days and weeks. The model is trained on 14 million pairs of occupational descriptions and HISCO codes in 13 different languages contributed by 22 different sources. Our approach is shown to have accuracy, recall, and precision above 90 percent. Our tool breaks the metaphorical HISCO barrier and makes this data readily available for analysis of occupational structures with broad applicability in economics, economic history, and various related disciplines.
