Table of Contents
Fetching ...

"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing

Johannes Kirmayr, Raphael Wennmacher, Khanh Huynh, Lukas Stappen, Elisabeth André, Florian Alt

TL;DR

The paper tackles how agentic LLM-based in-car assistants should communicate during long-running, multi-step tasks. It uses a controlled mixed-methods design (N=45) comparing Planning & Results (PR) intermediate feedback to No Intermediate (NI) final-only delivery, across stationary and driving contexts. Quantitative results show PR enhances perceived speed, user experience, and trust, while reducing task load, with effect sizes such as $d_z=1.01$ for perceived speed and $d_z=0.38$ for trust; qualitative data reveal adaptive verbosity strategies—high initial transparency followed by reductions as reliability grows, with situational re-expansion for novel or high-stakes tasks. The findings translate into design implications for feedback timing and content, advocating: (i) frequent, content-rich intermediate updates during long computations; (ii) adaptive verbosity gated by demonstrated reliability; and (iii) situational controls to balance transparency and distraction in dual-task driving and other primary-task contexts.

Abstract

Agentic AI assistants that autonomously perform multi-step tasks raise open questions for user experience: how should such systems communicate progress and reasoning during extended operations, especially in attention-critical contexts such as driving? We investigate feedback timing and verbosity from agentic LLM-based in-car assistants through a controlled, mixed-methods study (N=45) comparing planned steps and intermediate results feedback against silent operation with final-only response. Using a dual-task paradigm with an in-car voice assistant, we found that intermediate feedback significantly improved perceived speed, trust, and user experience while reducing task load - effects that held across varying task complexities and interaction contexts. Interviews further revealed user preferences for an adaptive approach: high initial transparency to establish trust, followed by progressively reducing verbosity as systems prove reliable, with adjustments based on task stakes and situational context. We translate our empirical findings into design implications for feedback timing and verbosity in agentic assistants, balancing transparency and efficiency.

"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing

TL;DR

The paper tackles how agentic LLM-based in-car assistants should communicate during long-running, multi-step tasks. It uses a controlled mixed-methods design (N=45) comparing Planning & Results (PR) intermediate feedback to No Intermediate (NI) final-only delivery, across stationary and driving contexts. Quantitative results show PR enhances perceived speed, user experience, and trust, while reducing task load, with effect sizes such as for perceived speed and for trust; qualitative data reveal adaptive verbosity strategies—high initial transparency followed by reductions as reliability grows, with situational re-expansion for novel or high-stakes tasks. The findings translate into design implications for feedback timing and content, advocating: (i) frequent, content-rich intermediate updates during long computations; (ii) adaptive verbosity gated by demonstrated reliability; and (iii) situational controls to balance transparency and distraction in dual-task driving and other primary-task contexts.

Abstract

Agentic AI assistants that autonomously perform multi-step tasks raise open questions for user experience: how should such systems communicate progress and reasoning during extended operations, especially in attention-critical contexts such as driving? We investigate feedback timing and verbosity from agentic LLM-based in-car assistants through a controlled, mixed-methods study (N=45) comparing planned steps and intermediate results feedback against silent operation with final-only response. Using a dual-task paradigm with an in-car voice assistant, we found that intermediate feedback significantly improved perceived speed, trust, and user experience while reducing task load - effects that held across varying task complexities and interaction contexts. Interviews further revealed user preferences for an adaptive approach: high initial transparency to establish trust, followed by progressively reducing verbosity as systems prove reliable, with adjustments based on task stakes and situational context. We translate our empirical findings into design implications for feedback timing and verbosity in agentic assistants, balancing transparency and efficiency.
Paper Structure (75 sections, 6 figures, 3 tables)

This paper contains 75 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Study apparatus: 1. Speaker (Voice User Interface), 2. Center Display (Graphical User Interface), 3a. Driving simulation in the form of a lane-keeping task, 3b: mouse to correct continuous lateral drift for driving simulation.
  • Figure 2: Quantitative study design: Each participant completed 8 tasks across the 2x2x2 conditions.
  • Figure 3: Quantitative study tasks: User requests for the two tasks with different durations, along with the assistant's final responses for the No Intermediate (NI) feedback system and assistant updates & final response for the Planning & Results (PR) feedback system. As the final answer is longer for the NI feedback system, it was started earlier (at 18s compared to 20s, respectively at 33s compared to 35s) so that the last output is at the same time for both systems. Note that at the beginning of both systems (NI and PR), a clicking sound, accompanied by the visual message "I am planning," is played to indicate the perception of the user request.
  • Figure 4: Task order and measurement timing: each participant performed all 8 tasks in hierarchically counterbalanced and randomized order. Dependent variables were measured (white boxes with black border) either after each task (perceived speed), after a 2-task block (UEQ+: user experience, NASA RTLX: task load), or after all 4 tasks per feedback system (S-TIAS: trust).
  • Figure 5: Scores for dependent variables of post-hoc t-tests when contrasting the feedback timing systems (NI vs. PR) collapsed across context and duration conditions. All scores show significant effect for the PR feedback timing: perceived speed shows a large effect ($p<.001, ***$, $95\%$ CI $[0.90, 1.54]$, $d_z=1.01$), user trust shows a small effect ($p=0.042, *$, $95\%$ CI $[0.01, 0.60]$, $d_z=0.38$), user experience KPI value showed a moderate effect ($p=0.002, **$, $95\%$ CI $[0.06,0.24]$, $d_z=0.54$), and task load showed a small effect ($p=0.034, *$, $95\%$ CI $[-8.54, -0.35]$, $d_z=-0.26$).
  • ...and 1 more figures