Table of Contents
Fetching ...

I Need Help! Evaluating LLM's Ability to Ask for Users' Support: A Case Study on Text-to-SQL Generation

Cheng-Kuang Wu, Zhi Rui Tam, Chao-Chung Wu, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen

Abstract

This study explores the proactive ability of LLMs to seek user support. We propose metrics to evaluate the trade-off between performance improvements and user burden, and investigate whether LLMs can determine when to request help under varying information availability. Our experiments show that without external feedback, many LLMs struggle to recognize their need for user support. The findings highlight the importance of external signals and provide insights for future research on improving support-seeking strategies. Source code: https://github.com/appier-research/i-need-help

I Need Help! Evaluating LLM's Ability to Ask for Users' Support: A Case Study on Text-to-SQL Generation

Abstract

This study explores the proactive ability of LLMs to seek user support. We propose metrics to evaluate the trade-off between performance improvements and user burden, and investigate whether LLMs can determine when to request help under varying information availability. Our experiments show that without external feedback, many LLMs struggle to recognize their need for user support. The findings highlight the importance of external signals and provide insights for future research on improving support-seeking strategies. Source code: https://github.com/appier-research/i-need-help
Paper Structure (23 sections, 5 equations, 9 figures, 4 tables)

This paper contains 23 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of our experiments on text-to-SQL. LLMs struggle to determine when they need help based solely on the instruction ($x$) or their output ($\hat{y}$). They require external feedback, such as the execution results ($\hat{r}$) from the database, to outperform random baselines.
  • Figure 2: Performance curves of gpt-3.5-turbo-0125. Curves of other LLMs are shown in Appendix \ref{['app:curves']}.
  • Figure 3: Performance curves of WizardCoder-Python-34B-V1.0.
  • Figure 4: Performance curves of Llama-3-70b-chat-hf.
  • Figure 5: Performance curves of deepseek-coder-33b-instruct.
  • ...and 4 more figures