I Need Help! Evaluating LLM's Ability to Ask for Users' Support: A Case Study on Text-to-SQL Generation

Cheng-Kuang Wu; Zhi Rui Tam; Chao-Chung Wu; Chieh-Yen Lin; Hung-yi Lee; Yun-Nung Chen

I Need Help! Evaluating LLM's Ability to Ask for Users' Support: A Case Study on Text-to-SQL Generation

Cheng-Kuang Wu, Zhi Rui Tam, Chao-Chung Wu, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen

Abstract

This study explores the proactive ability of LLMs to seek user support. We propose metrics to evaluate the trade-off between performance improvements and user burden, and investigate whether LLMs can determine when to request help under varying information availability. Our experiments show that without external feedback, many LLMs struggle to recognize their need for user support. The findings highlight the importance of external signals and provide insights for future research on improving support-seeking strategies. Source code: https://github.com/appier-research/i-need-help

I Need Help! Evaluating LLM's Ability to Ask for Users' Support: A Case Study on Text-to-SQL Generation

Abstract

Paper Structure (23 sections, 5 equations, 9 figures, 4 tables)

This paper contains 23 sections, 5 equations, 9 figures, 4 tables.

Introduction
Formulation for Seeking Support
General Setup
Evaluation
Methods for Seeking Support
Experiments
Dataset
Implementation
Main Results
Discussion
Analysis on the Delta-Burden Curves
LLMs without Access to Log Probabilities
Related Work
Conclusion
Limitations
...and 8 more sections

Figures (9)

Figure 1: Overview of our experiments on text-to-SQL. LLMs struggle to determine when they need help based solely on the instruction ($x$) or their output ($\hat{y}$). They require external feedback, such as the execution results ($\hat{r}$) from the database, to outperform random baselines.
Figure 2: Performance curves of gpt-3.5-turbo-0125. Curves of other LLMs are shown in Appendix \ref{['app:curves']}.
Figure 3: Performance curves of WizardCoder-Python-34B-V1.0.
Figure 4: Performance curves of Llama-3-70b-chat-hf.
Figure 5: Performance curves of deepseek-coder-33b-instruct.
...and 4 more figures

I Need Help! Evaluating LLM's Ability to Ask for Users' Support: A Case Study on Text-to-SQL Generation

Abstract

I Need Help! Evaluating LLM's Ability to Ask for Users' Support: A Case Study on Text-to-SQL Generation

Authors

Abstract

Table of Contents

Figures (9)