Table of Contents
Fetching ...

The building blocks of software work explain coding careers and language popularity

Xiangnan Feng, Johannes Wachs, Simone Daniotti, Frank Neffke

TL;DR

This paper builds a fine-grained taxonomy of software development tasks from Stack Overflow, linking micro-level problem solving to macro labor-market outcomes. Using a bipartite stochastic block model, PMI-based task relatedness, and UMAP/HDBSCAN for visualization, the authors identify 237 canonical software tasks and map them to real-world job ads and salaries. They show that task value predicts advertised salaries and that individuals learn and diversify across tasks, with Python enabling entry into higher-value tasks and broader career flexibility. The study demonstrates the utility of large-scale task taxonomies for understanding labor-market dynamics, technology diffusion, and language-driven career trajectories, while acknowledging limitations and offering avenues for application in education and workforce development.

Abstract

Recent waves of technological transformation have fueled debates about the changing nature of work. Yet to understand the future of work, we need to know more about what people actually do in their jobs, going beyond educational credentials or job descriptions. Here we analyze work in the global software industry using tens of millions of Question and Answer posts on Stack Overflow to create a fine-grained taxonomy of software tasks, the elementary building blocks of software development work. These tasks predict salaries and job requirements in real-world job ads. We also observe how individuals learn within tasks and diversify into new tasks. Tasks that people acquire tend to be related to their old ones, but of lower value, suggesting that they are easier. An exception is users of Python, an increasingly popular programming language known for its versatility. Python users enter tasks that tend to be higher-value, providing an explanation for the language's growing popularity based on the tasks Python enables its users to perform. In general, these insights demonstrate the value of task taxonomies extracted at scale from large datasets: they offer high resolution and near real-time descriptions of changing labor markets. In the case of software tasks, they map such changes for jobs at the forefront of a digitizing global economy.

The building blocks of software work explain coding careers and language popularity

TL;DR

This paper builds a fine-grained taxonomy of software development tasks from Stack Overflow, linking micro-level problem solving to macro labor-market outcomes. Using a bipartite stochastic block model, PMI-based task relatedness, and UMAP/HDBSCAN for visualization, the authors identify 237 canonical software tasks and map them to real-world job ads and salaries. They show that task value predicts advertised salaries and that individuals learn and diversify across tasks, with Python enabling entry into higher-value tasks and broader career flexibility. The study demonstrates the utility of large-scale task taxonomies for understanding labor-market dynamics, technology diffusion, and language-driven career trajectories, while acknowledging limitations and offering avenues for application in education and workforce development.

Abstract

Recent waves of technological transformation have fueled debates about the changing nature of work. Yet to understand the future of work, we need to know more about what people actually do in their jobs, going beyond educational credentials or job descriptions. Here we analyze work in the global software industry using tens of millions of Question and Answer posts on Stack Overflow to create a fine-grained taxonomy of software tasks, the elementary building blocks of software development work. These tasks predict salaries and job requirements in real-world job ads. We also observe how individuals learn within tasks and diversify into new tasks. Tasks that people acquire tend to be related to their old ones, but of lower value, suggesting that they are easier. An exception is users of Python, an increasingly popular programming language known for its versatility. Python users enter tasks that tend to be higher-value, providing an explanation for the language's growing popularity based on the tasks Python enables its users to perform. In general, these insights demonstrate the value of task taxonomies extracted at scale from large datasets: they offer high resolution and near real-time descriptions of changing labor markets. In the case of software tasks, they map such changes for jobs at the forefront of a digitizing global economy.

Paper Structure

This paper contains 34 sections, 14 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Mapping software tasks. a. Stylized depiction of the bipartite question-tag network. SBM groups tags into communities (tasks) that connect to similar sets of questions. ChatGPT-4.0 finds a common label that summarizes each community’s tag information. b. Task space. Pointwise mutual information ($\mathrm{PMI}$) expresses how surprisingly often two tasks are performed by the same users. UMAP embeds the resulting co-occurrence network in a 2-dimensional plane (the task space). c. Close-up on Develop AI models task, depicting the original network structure among the 10 most closely related tasks. d. Task values. Nodes are colored according to their task value, estimated from salary information in the SO 2023 developers survey. Darker shades indicate more valuable tasks. e. Table of the five most and least valuable tasks. For details, see Methods.
  • Figure 2: Job ads. a. Schematic representation of the workflow to extract salary and task requirements from online job ads by prompting ChatGPT. Task requirements are converted to the 237-dimensional SO task vectors of Fig. \ref{['fig:fig1_taskdef']} based on cosine similarities between text embeddings of task requirements and SO tasks. b. Prediction of task requirements. 40% of task vector elements are masked and grouped into 10 equally sized bins based on the fit of the masked task to the (unmasked) task requirements of a job (see Methods). The plot displays the estimated probability that tasks in a bin are required in the job. c. Predictions of wage offers. Jobs are grouped into equally sized bins based on the average value of required tasks. The vertical axis shows the estimated mean of the advertised wage offers. Vertical bars in panels b and c represent 95% confidence intervals.
  • Figure 3: Task dynamics. a. Task user-share change from 2009 to 2022. Purple markers signal increases, orange markers decreases, in user-shares between 2009 and 2022. Marker transparency reflects the size of shifts: darker tones indicate larger changes in user shares. b. Estimated probability of diversifying into new tasks at different values of density, users' relatedness-weighted experience in other tasks: $d_{\theta,u} = \sum_\kappa \frac{\mathrm{R}_{\theta,\kappa}}{\sum_\tau \mathrm{R}_{\theta,\tau}} X_{\kappa,u}$, where $X_{\kappa,u}$ denotes user $u$'s experience in task $\kappa$ and $\mathrm{R}_{\theta,\kappa}$ the relatedness between tasks $\theta$ and $\kappa$. c. Regression analysis of answer popularity on task experience. Popularity is measured as the number of votes the answer receives, $\log (\# \text{votes}_{a} + 1)$ or whether or not the answer is the top answer to the question, task experience as the number of prior answers the user provided to questions on task $\theta$ in the preceding two years, $\log(\# \text{answers}_{u(a),\theta,t(a)})$. Control variables include $\log (\# \text{answers}_{q(a)})$ and $\log \left(\sum_{\alpha \in A_q}\# \text{votes}_{\alpha} +1\right)$, the total number of answers provided to question $q(a))$, and the sum of all votes across these answers. To avoid problems due to $\log 0$ values, we add 1 to counts that can evaluate to 0. The plot shows point estimates with their 95% confidence intervals. Purple markers refer regression analyses of whether or not an answer is the top answer, orange markers to analyses of the number of votes. d. Regression of whether or not a user will adopt a new task $\theta$, on the task value, $V_\theta$, controlling for the user's density around the task, $d_{\theta,u}$. Purple markers refer to Python-related questions only, orange markers to the full sample.
  • Figure 4: Programming languages. a. Task-language matrix. Elements are colored when at least 10 SO users have at least one answer post in the task-language combination. b. Number of tasks in which a programming language ranks as the top language in terms of SO users. The graph shows time-series for the largest six languages in terms of cumulative SO posts between August 2008 and June 2023. c. Python's task footprint. Nodes are colored if Python ranks among the top 3 programming languages for the associated task. Fully colored: 1st rank, 50% transparency: 2nd rank, 75% transparency: 3rd rank.
  • Figure :
  • ...and 5 more figures