Table of Contents
Fetching ...

How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes

Sheri Osborn, Rohit Valecha, H. Raghav Rao, Dan Sass, Anthony Rios

TL;DR

The paper offers a reproducible benchmark for evaluating how large language models forecast labor-market changes due to AI, by coupling high-frequency job postings with global AI-adoption projections and enforcing leakage-free, time-aware evaluation. It systematically compares prompting strategies (task-structured, persona-driven, and hybrid) across model families, using a formal framework that conditions forecasts on historical data, prompts, and exogenous events. Key findings show that structured task prompts yield more stable long-horizon forecasts, while domain-grounded personas improve short- to mid-term predictions; performance is uneven across sectors and horizons, underscoring the importance of domain-aware prompting and robust evaluation. The work contributes a testbed for studying AI-assisted labor forecasting, prompt design, and economic reasoning with LLMs, offering practical guidance for researchers and policymakers designing future labor-market analyses.

Abstract

Artificial intelligence is reshaping labor markets, yet we lack tools to systematically forecast its effects on employment. This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand, especially in occupations affected by AI. Existing research has shown that LLMs can extract sentiment, summarize economic reports, and emulate forecaster behavior, but little work has assessed their use for forward-looking labor prediction. Our benchmark combines two complementary datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption. We format these data into forecasting tasks with clear temporal splits, minimizing the risk of information leakage. We then evaluate LLMs using multiple prompting strategies, comparing task-scaffolded, persona-driven, and hybrid approaches across model families. We assess both quantitative accuracy and qualitative consistency over time. Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends. However, performance varies significantly across sectors and horizons, highlighting the need for domain-aware prompting and rigorous evaluation protocols. By releasing our benchmark, we aim to support future research on labor forecasting, prompt design, and LLM-based economic reasoning. This work contributes to a growing body of research on how LLMs interact with real-world economic data, and provides a reproducible testbed for studying the limits and opportunities of AI as a forecasting tool in the context of labor markets.

How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes

TL;DR

The paper offers a reproducible benchmark for evaluating how large language models forecast labor-market changes due to AI, by coupling high-frequency job postings with global AI-adoption projections and enforcing leakage-free, time-aware evaluation. It systematically compares prompting strategies (task-structured, persona-driven, and hybrid) across model families, using a formal framework that conditions forecasts on historical data, prompts, and exogenous events. Key findings show that structured task prompts yield more stable long-horizon forecasts, while domain-grounded personas improve short- to mid-term predictions; performance is uneven across sectors and horizons, underscoring the importance of domain-aware prompting and robust evaluation. The work contributes a testbed for studying AI-assisted labor forecasting, prompt design, and economic reasoning with LLMs, offering practical guidance for researchers and policymakers designing future labor-market analyses.

Abstract

Artificial intelligence is reshaping labor markets, yet we lack tools to systematically forecast its effects on employment. This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand, especially in occupations affected by AI. Existing research has shown that LLMs can extract sentiment, summarize economic reports, and emulate forecaster behavior, but little work has assessed their use for forward-looking labor prediction. Our benchmark combines two complementary datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption. We format these data into forecasting tasks with clear temporal splits, minimizing the risk of information leakage. We then evaluate LLMs using multiple prompting strategies, comparing task-scaffolded, persona-driven, and hybrid approaches across model families. We assess both quantitative accuracy and qualitative consistency over time. Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends. However, performance varies significantly across sectors and horizons, highlighting the need for domain-aware prompting and rigorous evaluation protocols. By releasing our benchmark, we aim to support future research on labor forecasting, prompt design, and LLM-based economic reasoning. This work contributes to a growing body of research on how LLMs interact with real-world economic data, and provides a reproducible testbed for studying the limits and opportunities of AI as a forecasting tool in the context of labor markets.

Paper Structure

This paper contains 8 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overview of the LLM-based economic forecasting framework. The system integrates two datasets, job postings by sector and AI-related jobs by sector. Each experiment combines personas, time series inputs, and structured prompts to generate forecasts. Three forecasting strategies are used: direct forecasting, relative forecasting, and a event reasoning approach.
  • Figure 2: Count of outputs exceeding 5,000 MSE for each model, showing that GPT-4o-mini produces fewer extreme values than the LLaMA models. Results on the Indeed Long-Horizon data.