Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking

Anjali Khurana; Hari Subramonyam; Parmit K Chilana

Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking

Anjali Khurana, Hari Subramonyam, Parmit K Chilana

TL;DR

The paper investigates how non-expert users interact with prompt-based LLM assistance for software help by comparing SoftAIBot, a prompt-guided, domain-contextualized assistant, against Baseline ChatGPT in PowerPoint and Excel tasks. Using a within-subject design with 16 participants and mixed-methods (expert task metrics and follow-up interviews), SoftAIBot produced higher accuracy and relevance in its outputs, yet did not translate into improved task completion or user-perceived usefulness, and participants often trusted coherent but incorrect guidance. The authors highlight the gap between improved LLM outputs and user mental models, the tendency toward overtrust, and the need for transparent, explainable interfaces that help users evaluate prompt-based interactions. They argue for integrating visual cues, confidence measures, and multimodal guidance to bridge human-AI gaps in feature-rich software contexts and advocate interdisciplinary collaboration to design accountable, user-centered LLM help systems.

Abstract

Large Language Model (LLM) assistants, such as ChatGPT, have emerged as potential alternatives to search methods for helping users navigate complex, feature-rich software. LLMs use vast training data from domain-specific texts, software manuals, and code repositories to mimic human-like interactions, offering tailored assistance, including step-by-step instructions. In this work, we investigated LLM-generated software guidance through a within-subject experiment with 16 participants and follow-up interviews. We compared a baseline LLM assistant with an LLM optimized for particular software contexts, SoftAIBot, which also offered guidelines for constructing appropriate prompts. We assessed task completion, perceived accuracy, relevance, and trust. Surprisingly, although SoftAIBot outperformed the baseline LLM, our results revealed no significant difference in LLM usage and user perceptions with or without prompt guidelines and the integration of domain context. Most users struggled to understand how the prompt's text related to the LLM's responses and often followed the LLM's suggestions verbatim, even if they were incorrect. This resulted in difficulties when using the LLM's advice for software tasks, leading to low task completion rates. Our detailed analysis also revealed that users remained unaware of inaccuracies in the LLM's responses, indicating a gap between their lack of software expertise and their ability to evaluate the LLM's assistance. With the growing push for designing domain-specific LLM assistants, we emphasize the importance of incorporating explainable, context-aware cues into LLMs to help users understand prompt-based interactions, identify biases, and maximize the utility of LLM assistants.

Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking

TL;DR

Abstract

Paper Structure (24 sections, 6 figures)

This paper contains 24 sections, 6 figures.

Introduction
Related Work
Software Help-seeking evolution
LLM use for Task-Based Assistance
Prompt-based interactions
Method: Controlled Experiment and Follow-up Interviews
Participants
Design and Implementation of SoftAIBot and Baseline ChatGPT
SoftAIBot Intervention (GPT-4 with Prompt Guidelines and Software Documentation)
Baseline ChatGPT intervention (ChatGPT plus)
Choice of Application and Tasks
Study Design and Procedure
Data Collection and Analysis
Results
Task Completion, Accuracy and Relevance of LLM Assistance
...and 9 more sections

Figures (6)

Figure 1: SoftAIBot integrates domain context via documentation and offers prompt guidelines to construct better prompts: (a) allows users to type in the prompt text and submit it; (b) generates prompt suggestions in-response to a user’s text (in this case, also shows a sample transformed query that users can directly use); (c) formats the response as step-by-step instructions optimized for particular software contexts, in this case PowerPoint. To see the contrast in LLM response, please see Baseline in Figure \ref{['sample_task_ppt']}.
Figure 2: Overview of sample user task with PowerPoint application: (a) Users were asked to look up and use instructions from LLM intervention. In this case, Baseline ChatGPT mimics the existing ChatGPT plus based on GPT-4, where users can type in their prompt in the textbox and LLM provide assistance to users for variety of tasks; (b) Use LLM assistance to develop shown project timeline in Microsoft PowerPoint that is visual and animated.
Figure 3: Overview of participants’ responses to post-task questionnaire. Pearson Chi-Squared test showed no significant difference for each metric across both LLM interventions for completing both Excel and PowerPoint tasks. Despite having low completion rate and low task accuracy, the majority of users perceived that they obtained accurate (a) and relevant (b) assistance from both LLM interventions. Still, the majority of participants (c) found it difficult to apply LLM assistance and instructions to the software application to complete their task; (d) Participants overall did not find it difficult to craft prompts; a few participants did indicate that they struggled to find the correct words for Powerpoint tasks that were more visual and interactive. Although expert ratings showed that users did not finish the task accurately, most users believed that it was easier for them to finish the task using both forms of LLM assistance; (f) The majority of users trusted both LLMs - this was surprising to see because expert ratings showed that both LLMs frequently provided inaccurate assistance.
Figure 4: LLM Hallucination evidence (P15): In response to P15's prompt, “I want you to give me instructions on how to animate a shape that rotates from top to lower middle side and then come back up almost like a zigzag.”, Baseline ChatGPT generated the hallucinated response of Zigzag menu option (highlighted in red) which did not even exist in the software application.
Figure 5: Unsuccessful Prompting by using keyword-based approach with Baseline ChatGPT: (a) P07 prompted the LLM using the keywords interpreted from the task, and struggled in getting an relevant response and went through several rounds of clarifications with the LLM; (b) Eventually, user could not even get started and failed to perform the task on the software application (P07).
...and 1 more figures

Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking

TL;DR

Abstract

Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking

Authors

TL;DR

Abstract

Table of Contents

Figures (6)