Table of Contents
Fetching ...

Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment

Chengfeng Dou, Ying Zhang, Zhi Jin, Wenpin Jiao, Haiyan Zhao, Yongqiang Zhao, Zhengwei Tao

Abstract

This research examines the use of Reinforcement Learning from AI Feedback (RLAIF) techniques to improve healthcare dialogue models, with the aim of tackling the challenges of preference-aligned data annotation while reducing the reliance on medical experts. We argue that the primary challenges in current RLAIF research for healthcare are the limitations of automated evaluation methods and the difficulties in accurately representing physician preferences. To address these challenges, we present a new evaluation framework based on standardized patient examinations. This framework is designed to objectively assess the effectiveness of large language models (LLMs) in guiding users and following instructions, enabling a comprehensive comparison across different models. Furthermore, our investigation of effective ways to express physician preferences using Constitutional AI algorithms highlighted the particular effectiveness of flowcharts. Utilizing this finding, we introduce an innovative agent-based approach for annotating preference data. This approach autonomously creates medical dialogue flows tailored to the patient's condition, demonstrates strong generalization abilities, and reduces the need for expert involvement. Our results show that the agent-based approach outperforms existing RLAIF annotation methods in standardized patient examinations and surpasses current open source medical dialogue LLMs in various test scenarios.

Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment

Abstract

This research examines the use of Reinforcement Learning from AI Feedback (RLAIF) techniques to improve healthcare dialogue models, with the aim of tackling the challenges of preference-aligned data annotation while reducing the reliance on medical experts. We argue that the primary challenges in current RLAIF research for healthcare are the limitations of automated evaluation methods and the difficulties in accurately representing physician preferences. To address these challenges, we present a new evaluation framework based on standardized patient examinations. This framework is designed to objectively assess the effectiveness of large language models (LLMs) in guiding users and following instructions, enabling a comprehensive comparison across different models. Furthermore, our investigation of effective ways to express physician preferences using Constitutional AI algorithms highlighted the particular effectiveness of flowcharts. Utilizing this finding, we introduce an innovative agent-based approach for annotating preference data. This approach autonomously creates medical dialogue flows tailored to the patient's condition, demonstrates strong generalization abilities, and reduces the need for expert involvement. Our results show that the agent-based approach outperforms existing RLAIF annotation methods in standardized patient examinations and surpasses current open source medical dialogue LLMs in various test scenarios.
Paper Structure (37 sections, 4 equations, 11 figures, 6 tables)

This paper contains 37 sections, 4 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: (A): The research method in this work. (B): Two facets of dialogue competence evaluation. 'User Guidance' competence denotes the doctor's skill in prompting the patient to describe symptoms. Patients, often lacking medical knowledge, might fail to relay their information clearly and hence, need guidance from the doctor. 'Instruction Following' competence pertains to the doctor's skill in responding to the patient's questions accurately and amicably.
  • Figure 2: Prompt words used to generate candidate responses.
  • Figure 3: Structure of the Patient Simulator. The dialogue content and patient details are initially divided into smaller segments, which are then stored in individual databases for each patient. The text-embedding-ada-002 model from OpenAI is used to encode these segments, producing vectors for similarity retrieval. During operation, the simulator retrieves the top four most relevant segments from the database based on the doctor's inquiries. Subsequently, GPT-4 utilizes the prompt depicted in Figure \ref{['fig:ps_prompt']} to formulate the patient's response by merging the dialogue history with the retrieved information.
  • Figure 4: Prompt words for patient simulator, information drawn from the patient database is located at Documents, with the dialogue history and the doctor's inquiries placed at History and Question, respectively. Question serves to retrieve patient information. Notably, as Question includes sufficient context to ensure accurate retrieval, the last two rounds of dialogue are used as Question.
  • Figure 5: Three methods for labeling preference alignment data, where the strategy on the right improves upon the strategy on the left.
  • ...and 6 more figures