Table of Contents
Fetching ...

RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering

Weikang Qiu, Tinglin Huang, Ryan Rullo, Yucheng Kuang, Ali Maatouk, S. Raquel Ramos, Rex Ying

TL;DR

This paper introduces RepQA, a benchmark engineered to evaluate the readability of large language models in public health question answering. It combines a expert-verified evaluation set of 533 QA pairs with a large-scale training question corpus (36,060 items) and a proxy multiple-choice accuracy task, enabling simultaneous assessment of readability and informativeness. The study benchmarks 25 LLMs and analyzes four readability-enhancement strategies, with a novel token-adapted GRPO method delivering the strongest readability gains at the cost of some QA accuracy. Across experiments, models struggle to follow explicit readability targets and to infer appropriate reading levels, while readability-focused training can improve clarity but risks omitting important information. The results highlight a critical trade-off and offer a path toward more practical, trustworthy public health agents by balancing readability and factual accuracy.

Abstract

Large Language Models (LLMs) hold promise in addressing complex medical problems. However, while most prior studies focus on improving accuracy and reasoning abilities, a significant bottleneck in developing effective healthcare agents lies in the readability of LLM-generated responses, specifically, their ability to answer public health problems clearly and simply to people without medical backgrounds. In this work, we introduce RephQA, a benchmark for evaluating the readability of LLMs in public health question answering (QA). It contains 533 expert-reviewed QA pairs from 27 sources across 13 topics, and includes a proxy multiple-choice task to assess informativeness, along with two readability metrics: Flesch-Kincaid grade level and professional score. Evaluation of 25 LLMs reveals that most fail to meet readability standards, highlighting a gap between reasoning and effective communication. To address this, we explore four readability-enhancing strategies-standard prompting, chain-of-thought prompting, Group Relative Policy Optimization (GRPO), and a token-adapted variant. Token-adapted GRPO achieves the best results, advancing the development of more practical and user-friendly public health agents. These results represent a step toward building more practical agents for public health.

RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering

TL;DR

This paper introduces RepQA, a benchmark engineered to evaluate the readability of large language models in public health question answering. It combines a expert-verified evaluation set of 533 QA pairs with a large-scale training question corpus (36,060 items) and a proxy multiple-choice accuracy task, enabling simultaneous assessment of readability and informativeness. The study benchmarks 25 LLMs and analyzes four readability-enhancement strategies, with a novel token-adapted GRPO method delivering the strongest readability gains at the cost of some QA accuracy. Across experiments, models struggle to follow explicit readability targets and to infer appropriate reading levels, while readability-focused training can improve clarity but risks omitting important information. The results highlight a critical trade-off and offer a path toward more practical, trustworthy public health agents by balancing readability and factual accuracy.

Abstract

Large Language Models (LLMs) hold promise in addressing complex medical problems. However, while most prior studies focus on improving accuracy and reasoning abilities, a significant bottleneck in developing effective healthcare agents lies in the readability of LLM-generated responses, specifically, their ability to answer public health problems clearly and simply to people without medical backgrounds. In this work, we introduce RephQA, a benchmark for evaluating the readability of LLMs in public health question answering (QA). It contains 533 expert-reviewed QA pairs from 27 sources across 13 topics, and includes a proxy multiple-choice task to assess informativeness, along with two readability metrics: Flesch-Kincaid grade level and professional score. Evaluation of 25 LLMs reveals that most fail to meet readability standards, highlighting a gap between reasoning and effective communication. To address this, we explore four readability-enhancing strategies-standard prompting, chain-of-thought prompting, Group Relative Policy Optimization (GRPO), and a token-adapted variant. Token-adapted GRPO achieves the best results, advancing the development of more practical and user-friendly public health agents. These results represent a step toward building more practical agents for public health.

Paper Structure

This paper contains 41 sections, 5 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Two examples of user interactions with low- and high-readability patient education agents. Medical jargon is underscored in the responses generated by the low-readability agent.
  • Figure 2: Overview of the RepQA dataset construction and evaluation pipeline. The dataset is curated through expert review and multiple-choice question generation, while the evaluation involves readability assessment and accuracy measurement.
  • Figure 3: Readability control and understanding across four models (LLaMA 3.1–8B, Qwen-3-30B, DeepSeek-R1-Distill-Qwen, Qwen-3-30B-Thinking). (a) Achieved Flesch–Kincaid grades when instructed to write at target levels 6/9/12/15 (boxplots). Red horizontal lines mark the targets, and blue dashed lines mark the per-target mean grade. (b) Mean Flesch–Kincaid grade vs. target (error bars show 95% CI). (c) Accuracy of classifying generated responses into Flesch–Kincaid buckets (Elementary/Middle/High/College). (d) Accuracy when classifying Original/Medium/High counterparts of the same question.
  • Figure 4: Two examples of improved readability but reduced accuracy.