OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models
Mingfeng Xue, Dayiheng Liu, Kexin Yang, Guanting Dong, Wenqiang Lei, Zheng Yuan, Chang Zhou, Jingren Zhou
TL;DR
The paper addresses occupational bias in instruction-tuning data for large language models by introducing OccuQuest, a large-scale dataset spanning 1,013 occupations across 26 categories, built through hierarchical prompts and ChatGPT to generate prompts, responses, and multi-turn dialogues. OccuQuest exhibits a more balanced occupational distribution than existing datasets, enabling models to better assist practitioners across diverse fields. Fine-tuning LLaMA to create OccuLLaMA shows superior performance on occupation-related questions compared with state-of-the-art LLaMA derivatives, achieving a GPT-4-consistent win rate of $86.4\%$ against WizardLM on occu-quora. The authors further demonstrate ProLLaMA, a combined model with Tulu, that achieves strong results on comprehensive benchmarks (MMLU, GSM8K, BBH, HumanEval), underscoring the value of integrating OccuQuest with existing data. By releasing both the dataset and model parameters, this work advances inclusive LLMs and provides a framework for reducing occupational bias, while acknowledging limitations and ethical considerations.
Abstract
The emergence of large language models (LLMs) has revolutionized natural language processing tasks. However, existing instruction-tuning datasets suffer from occupational bias: the majority of data relates to only a few occupations, which hampers the instruction-tuned LLMs to generate helpful responses to professional queries from practitioners in specific fields. To mitigate this issue and promote occupation-inclusive LLMs, we create an instruction-tuning dataset named \emph{OccuQuest}, which contains 110,000+ prompt-completion pairs and 30,000+ dialogues covering over 1,000 occupations in 26 occupational categories. We systematically request ChatGPT, organizing queries hierarchically based on Occupation, Responsibility, Topic, and Question, to ensure a comprehensive coverage of occupational specialty inquiries. By comparing with three commonly used datasets (Dolly, ShareGPT, and WizardLM), we observe that OccuQuest exhibits a more balanced distribution across occupations. Furthermore, we assemble three test sets for comprehensive evaluation, an occu-test set covering 25 occupational categories, an estate set focusing on real estate, and an occu-quora set containing real-world questions from Quora. We then fine-tune LLaMA on OccuQuest to obtain OccuLLaMA, which significantly outperforms state-of-the-art LLaMA variants (Vicuna, Tulu, and WizardLM) on professional questions in GPT-4 and human evaluations. Notably, on the occu-quora set, OccuLLaMA reaches a high win rate of 86.4\% against WizardLM.
