Table of Contents
Fetching ...

A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models

Palakorn Achananuparp, Ee-Peng Lim, Yao Lu

TL;DR

This work tackles automatic occupation classification using standardized taxonomies by first assessing how well large language models memorize and utilize taxonomy knowledge, revealing gaps especially in smaller models. It then introduces a multi-stage framework—inference, retrieval, and reranking—augmented by taxonomy-grounded reasoning examples (TGRE) to align LLM outputs with the O*NET-SOC taxonomy and other taxonomies like ESCO. Across occupation and skill classification tasks on large-scale real-world data (Jobs12K and ESCO-based post data), TGRE with sentence-based retrieval consistently outperforms baseline prompting approaches, achieving strong Precision@1 and RP@K while remaining cost-effective compared to frontier models. The results demonstrate a practical, scalable approach to taxonomy-aware classification that generalizes across domains and LLMs, with implications for labor market analysis and computational social science. The framework highlights the value of external taxonomic knowledge and structured reasoning over purely chain-of-thought prompts for domain-specific tasks.

Abstract

Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations, especially for smaller models. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show that our framework not only enhances occupation and skill classification tasks, but also provides a cost-effective alternative to frontier models like GPT-4o, significantly reducing computational costs while maintaining strong performance. This makes it a practical and scalable solution for occupation classification and related tasks across LLMs.

A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models

TL;DR

This work tackles automatic occupation classification using standardized taxonomies by first assessing how well large language models memorize and utilize taxonomy knowledge, revealing gaps especially in smaller models. It then introduces a multi-stage framework—inference, retrieval, and reranking—augmented by taxonomy-grounded reasoning examples (TGRE) to align LLM outputs with the O*NET-SOC taxonomy and other taxonomies like ESCO. Across occupation and skill classification tasks on large-scale real-world data (Jobs12K and ESCO-based post data), TGRE with sentence-based retrieval consistently outperforms baseline prompting approaches, achieving strong Precision@1 and RP@K while remaining cost-effective compared to frontier models. The results demonstrate a practical, scalable approach to taxonomy-aware classification that generalizes across domains and LLMs, with implications for labor market analysis and computational social science. The framework highlights the value of external taxonomic knowledge and structured reasoning over purely chain-of-thought prompts for domain-specific tasks.

Abstract

Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations, especially for smaller models. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show that our framework not only enhances occupation and skill classification tasks, but also provides a cost-effective alternative to frontier models like GPT-4o, significantly reducing computational costs while maintaining strong performance. This makes it a practical and scalable solution for occupation classification and related tasks across LLMs.

Paper Structure

This paper contains 40 sections, 23 figures, 11 tables.

Figures (23)

  • Figure 1: The proposed framework
  • Figure 2: Prompt templates for knowledge recall and recognition tasks. Variables in the form ${$var\_name$}, highlighted in blue, are replaced with corresponding values from the O*NET-SOC taxonomy.
  • Figure 3: TGRE-based prompt templates for occupation classification. Variables in the form ${$var\_name$}, highlighted in blue, are replaced with corresponding values from task inputs
  • Figure 4: CoT-based prompt templates for occupation classification. Variables in the form ${$var\_name$}, highlighted in blue, are replaced with corresponding values from task inputs.
  • Figure 5: In-context examples for TGRE-based prompts for occupation classification. Segments highlighted in blue are detailed description of the corresponding occupation retrieved from the taxonomy.
  • ...and 18 more figures