Table of Contents
Fetching ...

HR-MultiWOZ: A Task Oriented Dialogue (TOD) Dataset for HR LLM Agent

Weijie Xu, Zicheng Huang, Wenxiang Hu, Xi Fang, Rajesh Kumar Cherukuri, Naumaan Nayyar, Lorenzo Malandri, Srinivasan H. Sengamedu

TL;DR

HR-MultiWOZ introduces the first open-source, HR-specific task-oriented dialogue dataset comprising 550 conversations across 10 HR domains to evaluate and train HR LLM agents. The authors present a transfer-friendly data-generation pipeline that leverages LLMs (Claude) for scenario creation and paraphrasing, paired with DeBERTa-based extractive labeling and MTurk verification to ensure high-quality, long-entity dialogue states. The dataset emphasizes extractiveness, domain relevance, and empathetic interactions, and demonstrates stronger dialogue richness and linguistic diversity than existing TOD resources. This work enables cost-efficient development of empathetic HR assistants and establishes a benchmark for HR-specific dialogue systems with ethical considerations and clear limitations. The authors also propose future enhancements, including multilingual expansion and API integration.

Abstract

Recent advancements in Large Language Models (LLMs) have been reshaping Natural Language Processing (NLP) task in several domains. Their use in the field of Human Resources (HR) has still room for expansions and could be beneficial for several time consuming tasks. Examples such as time-off submissions, medical claims filing, and access requests are noteworthy, but they are by no means the sole instances. However, the aforementioned developments must grapple with the pivotal challenge of constructing a high-quality training dataset. On one hand, most conversation datasets are solving problems for customers not employees. On the other hand, gathering conversations with HR could raise privacy concerns. To solve it, we introduce HR-Multiwoz, a fully-labeled dataset of 550 conversations spanning 10 HR domains to evaluate LLM Agent. Our work has the following contributions: (1) It is the first labeled open-sourced conversation dataset in the HR domain for NLP research. (2) It provides a detailed recipe for the data generation procedure along with data analysis and human evaluations. The data generation pipeline is transferable and can be easily adapted for labeled conversation data generation in other domains. (3) The proposed data-collection pipeline is mostly based on LLMs with minimal human involvement for annotation, which is time and cost-efficient.

HR-MultiWOZ: A Task Oriented Dialogue (TOD) Dataset for HR LLM Agent

TL;DR

HR-MultiWOZ introduces the first open-source, HR-specific task-oriented dialogue dataset comprising 550 conversations across 10 HR domains to evaluate and train HR LLM agents. The authors present a transfer-friendly data-generation pipeline that leverages LLMs (Claude) for scenario creation and paraphrasing, paired with DeBERTa-based extractive labeling and MTurk verification to ensure high-quality, long-entity dialogue states. The dataset emphasizes extractiveness, domain relevance, and empathetic interactions, and demonstrates stronger dialogue richness and linguistic diversity than existing TOD resources. This work enables cost-efficient development of empathetic HR assistants and establishes a benchmark for HR-specific dialogue systems with ethical considerations and clear limitations. The authors also propose future enhancements, including multilingual expansion and API integration.

Abstract

Recent advancements in Large Language Models (LLMs) have been reshaping Natural Language Processing (NLP) task in several domains. Their use in the field of Human Resources (HR) has still room for expansions and could be beneficial for several time consuming tasks. Examples such as time-off submissions, medical claims filing, and access requests are noteworthy, but they are by no means the sole instances. However, the aforementioned developments must grapple with the pivotal challenge of constructing a high-quality training dataset. On one hand, most conversation datasets are solving problems for customers not employees. On the other hand, gathering conversations with HR could raise privacy concerns. To solve it, we introduce HR-Multiwoz, a fully-labeled dataset of 550 conversations spanning 10 HR domains to evaluate LLM Agent. Our work has the following contributions: (1) It is the first labeled open-sourced conversation dataset in the HR domain for NLP research. (2) It provides a detailed recipe for the data generation procedure along with data analysis and human evaluations. The data generation pipeline is transferable and can be easily adapted for labeled conversation data generation in other domains. (3) The proposed data-collection pipeline is mostly based on LLMs with minimal human involvement for annotation, which is time and cost-efficient.
Paper Structure (16 sections, 9 figures, 7 tables)

This paper contains 16 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The figure describes the data generation pipeline. The HR experts start by identifying tasks, creating schemas, and generating employee profiles. LLM is applied to generate diverse scenarios and paraphrase to make the conversation more natural. The label is then extracted by DeBERTa and refined by MTurk.
  • Figure 2: The figure describes a conversation generation process. We first identify task, schema and employee profile. We then use LLM to fill out the value in the schema. We then use LLM to rephrase the conversation to be more natural. We highlight the part that HR assistant show empathy in red.
  • Figure 3: MTurk Questions and selected examples to understand if extracted answer is equivalent to the ground truth
  • Figure 4: MTurk Score Distribution to understand if the HR question is clear
  • Figure 5: MTurk Score Distribution to understand if the HR question is polite
  • ...and 4 more figures