Table of Contents
Fetching ...

LLMs in social services: How does chatbot accuracy affect human accuracy?

Jennah Gosciak, Eric Giannella, Zhaowen Guo, Michael Chen, Allison Koenecke

Abstract

Social service programs like the Supplemental Nutrition Assistance Program (SNAP, or food stamps) have eligibility rules that can be challenging to understand. For nonprofit caseworkers who often support clients in navigating a dozen or more complex programs, LLM-based chatbots may offer a means to provide better, faster help to clients whose situations may be less common. In this paper, we measure the potential effects of LLM-based chatbot suggestions on caseworkers' ability to provide accurate guidance. We first created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Next, using these benchmark questions and corresponding expert-verified answers, we conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Caseworkers in the control condition did not see chatbot suggestions and had a mean accuracy of 49%. Caseworkers in the treatment condition saw chatbot suggestions that we artificially varied to range in aggregate accuracy from low (53%) to high (100%). Caseworker performance significantly improves as chatbot quality improves: high-quality chatbots (96-100% accurate) improved caseworker accuracy by 27 percentage points. At the question-level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best (without chatbot suggestions). Finally, improvements in caseworker accuracy level off as chatbot accuracy increases, a phenomenon that we call the "AI underreliance plateau," which is a concern for real-world deployment and highlights the importance of evaluating human-in-the-loop tools with their users.

LLMs in social services: How does chatbot accuracy affect human accuracy?

Abstract

Social service programs like the Supplemental Nutrition Assistance Program (SNAP, or food stamps) have eligibility rules that can be challenging to understand. For nonprofit caseworkers who often support clients in navigating a dozen or more complex programs, LLM-based chatbots may offer a means to provide better, faster help to clients whose situations may be less common. In this paper, we measure the potential effects of LLM-based chatbot suggestions on caseworkers' ability to provide accurate guidance. We first created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Next, using these benchmark questions and corresponding expert-verified answers, we conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Caseworkers in the control condition did not see chatbot suggestions and had a mean accuracy of 49%. Caseworkers in the treatment condition saw chatbot suggestions that we artificially varied to range in aggregate accuracy from low (53%) to high (100%). Caseworker performance significantly improves as chatbot quality improves: high-quality chatbots (96-100% accurate) improved caseworker accuracy by 27 percentage points. At the question-level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best (without chatbot suggestions). Finally, improvements in caseworker accuracy level off as chatbot accuracy increases, a phenomenon that we call the "AI underreliance plateau," which is a concern for real-world deployment and highlights the importance of evaluating human-in-the-loop tools with their users.
Paper Structure (16 sections, 32 figures, 20 tables)

This paper contains 16 sections, 32 figures, 20 tables.

Figures (32)

  • Figure 1: (A) A typical client-caseworker interaction, in which a caseworker can use several different tools (including Google, policy manuals, ChatGPT, and a custom chatbot) to answer questions related to SNAP eligibility and reporting. We are interested in the effect of custom chatbot suggestions (orange). (B) Experimental design involving a control group which only sees the question text shown in (C) with no chatbot suggestions, and a treatment group assigned to see the questions in (C) with one of the simulated chatbot suggestions shown in (D). Overall chatbot performance levels were artificially varied across treatment arms. There were 10 different levels of aggregate simulated chatbot accuracy ranging from 53% accurate (21 incorrect suggestions) to 100% accurate (0 incorrect suggestions). (C) Example of an assessment question (drawn from questions in the benchmark dataset). Questions consist of a client background panel (Top) with five different client attributes that can have up to two distinct values and a multiple-choice question (Bottom) with four or five multiple choice options. (D) Examples of correct and incorrect chatbot suggestions. We can toggle whether a suggestion is correct or incorrect for each question, allowing us to vary the overall chatbot accuracy for a given assessment. Additionally, on four assessment questions, like the one shown here, the correct answer depends on a unique combination of client attributes. While in this example, the correct answer is that the client can deduct transportation costs for medical appointments, if the client attributes were instead "Age: Adult (22-59)" and "Healthcare/Disability status: Not legally disabled," the correct answer would be that the client is ineligible to deduct medical expenses. Our approach tests the effect of correct and incorrect suggestions for both versions of the question.
  • Figure 2: (A) Caseworker participant-level accuracy for any chatbot on average ("Average chatbot") and by chatbot quality: low (53-73% accurate), medium (80-93% accurate), and high quality (96-100%). We assign these qualitative labels of chatbot quality to specific accuracy ranges to ensure sample sizes are reasonably balanced across groups: 26 caseworkers saw low-quality chatbots, 17 saw medium quality chatbots, and 51 caseworkers saw high quality chatbots. Overall, caseworker accuracy improves with chatbot suggestions by 21 percentage points, regardless of chatbot quality. However, high quality chatbots lead to the largest improvements in accuracy. (B) Human accuracy increases alongside chatbot accuracy up to a point, resulting in an AI underreliance plateau around $80\%$ accuracy. The dashed horizontal line shows accuracy from participants in the control group, which is roughly comparable to the worst-performing chatbot. The black line illustrates human accuracy if respondents followed $100\%$ of the chatbot suggestions. We include the 95% confidence interval for each estimate in light gray. Due to small sample sizes for some chatbot accuracy conditions, we estimate 95% confidence intervals using the student t-distribution.
  • Figure 3: Incorrect chatbot suggestions lead to the largest reductions in caseworker accuracy on easy and medium questions relative to accuracy in the control group, wherein no chatbot suggestions are seen. At the same time, correct chatbot suggestions can be beneficial, particularly on hard questions where they increase caseworker accuracy by 45 percentage points on average. Question difficulty appears on the x-axis and caseworker question-level accuracy on the y-axis. We denote 95% confidence intervals with error bars using cluster robust standard errors. We define question-level accuracy as the percentage of correct caseworker responses by chatbot suggestion type: incorrect, correct, or no suggestions (purple, violet, and yellow in the figure). To improve the comparison of our estimates, we only include questions that ever appear in an assessment with an incorrect chatbot suggestion (35 questions in total). These questions appear to be harder on average for caseworkers in the control group, hence the control group accuracy is lower compared to the participant-level estimates in Figure \ref{['fig:accuracy_panel']}.
  • Figure 4: Median caseworker time to answer each question is similar with and without chatbot suggestions, and decreases over the course of the assessment. The position in the survey (from 1 to 45) is on the x-axis. The median question duration is on the y-axis. We use the median question duration, as large outliers may skew results. The line color indicates whether participants saw LLM suggestions (treatment group) or no LLM suggestions (control group). Additionally, we include a trend line with 95% confidence intervals. Trends are similar for caseworkers who saw chatbot suggestions and those who did not, as demonstrated by the overlapping lines.
  • Figure S1: An example from the SNAP QC Error Viewer, an application that visualizes quality control errors in SNAP cases based on nationally representative data. This view allow us to examine the kinds of errors that frequently occur and potential reasons for the error (such as "Agency failed to verify required information").
  • ...and 27 more figures