OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models

Mingfeng Xue; Dayiheng Liu; Kexin Yang; Guanting Dong; Wenqiang Lei; Zheng Yuan; Chang Zhou; Jingren Zhou

OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models

Mingfeng Xue, Dayiheng Liu, Kexin Yang, Guanting Dong, Wenqiang Lei, Zheng Yuan, Chang Zhou, Jingren Zhou

TL;DR

The paper addresses occupational bias in instruction-tuning data for large language models by introducing OccuQuest, a large-scale dataset spanning 1,013 occupations across 26 categories, built through hierarchical prompts and ChatGPT to generate prompts, responses, and multi-turn dialogues. OccuQuest exhibits a more balanced occupational distribution than existing datasets, enabling models to better assist practitioners across diverse fields. Fine-tuning LLaMA to create OccuLLaMA shows superior performance on occupation-related questions compared with state-of-the-art LLaMA derivatives, achieving a GPT-4-consistent win rate of $86.4\%$ against WizardLM on occu-quora. The authors further demonstrate ProLLaMA, a combined model with Tulu, that achieves strong results on comprehensive benchmarks (MMLU, GSM8K, BBH, HumanEval), underscoring the value of integrating OccuQuest with existing data. By releasing both the dataset and model parameters, this work advances inclusive LLMs and provides a framework for reducing occupational bias, while acknowledging limitations and ethical considerations.

Abstract

The emergence of large language models (LLMs) has revolutionized natural language processing tasks. However, existing instruction-tuning datasets suffer from occupational bias: the majority of data relates to only a few occupations, which hampers the instruction-tuned LLMs to generate helpful responses to professional queries from practitioners in specific fields. To mitigate this issue and promote occupation-inclusive LLMs, we create an instruction-tuning dataset named \emph{OccuQuest}, which contains 110,000+ prompt-completion pairs and 30,000+ dialogues covering over 1,000 occupations in 26 occupational categories. We systematically request ChatGPT, organizing queries hierarchically based on Occupation, Responsibility, Topic, and Question, to ensure a comprehensive coverage of occupational specialty inquiries. By comparing with three commonly used datasets (Dolly, ShareGPT, and WizardLM), we observe that OccuQuest exhibits a more balanced distribution across occupations. Furthermore, we assemble three test sets for comprehensive evaluation, an occu-test set covering 25 occupational categories, an estate set focusing on real estate, and an occu-quora set containing real-world questions from Quora. We then fine-tune LLaMA on OccuQuest to obtain OccuLLaMA, which significantly outperforms state-of-the-art LLaMA variants (Vicuna, Tulu, and WizardLM) on professional questions in GPT-4 and human evaluations. Notably, on the occu-quora set, OccuLLaMA reaches a high win rate of 86.4\% against WizardLM.

OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models

TL;DR

against WizardLM on occu-quora. The authors further demonstrate ProLLaMA, a combined model with Tulu, that achieves strong results on comprehensive benchmarks (MMLU, GSM8K, BBH, HumanEval), underscoring the value of integrating OccuQuest with existing data. By releasing both the dataset and model parameters, this work advances inclusive LLMs and provides a framework for reducing occupational bias, while acknowledging limitations and ethical considerations.

Abstract

Paper Structure (27 sections, 28 figures, 7 tables)

This paper contains 27 sections, 28 figures, 7 tables.

Introduction
Related Works
Bias in Datasets
Instruction Tuning
OccuQuest Dataset
Dataset Construction
Dataset Split
Balanced Distribution of Occupations
Experiments
Baselines
Training Details
Evaluation Setup
GPT-4 Evaluation
Human Evaluation
Experimental Results
...and 12 more sections

Figures (28)

Figure 1: The distribution of occupational categories across various datasets.
Figure 2: An illustration of the OccuQuest dataset construction process, where the contents highlighted with a background color are ultimately gathered to constitute the dataset. To eliminate duplicate samples, MinHash filtering is applied after steps 2, 3, and 4.
Figure 3: GPT-4 evaluation results on OccuLLaMA against the comparative baselines.
Figure 4: The win rates of OccuLLaMA vs Vicuna under different occupational categories.
Figure 5: GPT-4 evaluation results on ProLLaMA against the comparative baselines.
...and 23 more figures

OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models

TL;DR

Abstract

OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (28)