Successfully Guiding Humans with Imperfect Instructions by Highlighting Potential Errors and Suggesting Corrections

Lingjun Zhao; Khanh Nguyen; Hal Daumé

Successfully Guiding Humans with Imperfect Instructions by Highlighting Potential Errors and Suggesting Corrections

Lingjun Zhao, Khanh Nguyen, Hal Daumé

TL;DR

This work introduces HEAR, a system that detects hallucinations in language-guided navigation instructions and proposes corrections, paired with a user interface that highlights potential errors and reveals correction options on demand. By combining a hallucination detector and a hallucination-type classifier with synthetic data generation, HEAR can rank corrections and guide humans through long-horizon 3D navigation tasks even when instructions are imperfect. In experiments with 80 human participants, HEAR improves success rate and reduces final-location error, and user engagement with exploration prompts increases task persistence and performance. The results demonstrate that structured uncertainty communication can significantly boost human decision-making in sequential, vision-language tasks and offer a generalizable approach for robust AI-assisted navigation.

Abstract

Language models will inevitably err in situations with which they are unfamiliar. However, by effectively communicating uncertainties, they can still guide humans toward making sound decisions in those contexts. We demonstrate this idea by developing HEAR, a system that can successfully guide humans in simulated residential environments despite generating potentially inaccurate instructions. Diverging from systems that provide users with only the instructions they generate, HEAR warns users of potential errors in its instructions and suggests corrections. This rich uncertainty information effectively prevents misguidance and reduces the search space for users. Evaluation with 80 users shows that HEAR achieves a 13% increase in success rate and a 29% reduction in final location error distance compared to only presenting instructions to users. Interestingly, we find that offering users possibilities to explore, HEAR motivates them to make more attempts at the task, ultimately leading to a higher success rate. To our best knowledge, this work is the first to show the practical benefits of uncertainty communication in a long-horizon sequential decision-making problem.

Successfully Guiding Humans with Imperfect Instructions by Highlighting Potential Errors and Suggesting Corrections

TL;DR

Abstract

Paper Structure (41 sections, 1 equation, 10 figures, 4 tables)

This paper contains 41 sections, 1 equation, 10 figures, 4 tables.

Introduction
Related Work
Grounded instruction generation.
Uncertainty communication for human-AI collaboration.
Hallucination detection.
Problem Setting
HEAR: Hallucination Detection and Remedy
Hallucination Detection
Correction Suggestion
Ranking suggestions.
Hallucination type classification.
Dataset Creation
Training data for hallucination detection.
Training data for hallucination-type classification.
Generating sets of candidate corrections.
...and 26 more sections

Figures (10)

Figure 1: HEAR detects errors in a navigation instruction and suggests corrections. It enables humans to avoid being misled and efficiently search the environment, leading to improved performance.
Figure 2: Our hallucination detection model (top) and hallucination type classification model (bottom). Each model takes a language instruction and a visual route as input and predicts a binary label. For hallucination detection, the label is whether a phrase is a hallucination. For hallucination-type classification, the label is whether a hallucination is extrinsic (needed to be replaced) or extrinsic (needed to be removed). Each model is built on top of a pre-trained vision-language model and is fine-tuned using contrastive learning. The first model is used to decide which phrases to highlight in an instruction, and the two models are combined to score and rank possible corrections.
Figure 3: Performance measured by success rate (SR ↑) and navigation error (DIST ↓), and the number of check-button clicks recorded when human users perform navigation tasks with different assistant systems. HEAR improves user navigation performance and is competitive with the two Oracle systems. The error bars for SR represent 85% confidence intervals. For DIST and Checks, the "x" marks the mean, the line inside a bar marks the median, and the box represents the interquartile. \ref{['tab:main_result']} shows the corresponding results in table format.
Figure 4: Example success and failure cases of HEAR (more in \ref{['app:qualitative_examples']}).
Figure 5: Introductory page of the human navigation task. A video instruction is provided.
...and 5 more figures

Successfully Guiding Humans with Imperfect Instructions by Highlighting Potential Errors and Suggesting Corrections

TL;DR

Abstract

Successfully Guiding Humans with Imperfect Instructions by Highlighting Potential Errors and Suggesting Corrections

Authors

TL;DR

Abstract

Table of Contents

Figures (10)