Table of Contents
Fetching ...

Using LLMs to Investigate Correlations of Conversational Follow-up Queries with User Satisfaction

Hyunwoo Kim, Yoonseo Choi, Taehyun Yang, Honggu Lee, Chaneon Park, Yongju Lee, Jin Young Kim, Juho Kim

TL;DR

This work investigates how follow-up queries in conversational search signal user satisfaction and intent. By constructing an 18-theme taxonomy across two axes—user motivation and follow-up patterns—from in-lab and real-world data, and by building an LLM-powered classifier (74%+ accuracy) to label logs at scale, the authors quantify how certain follow-up behaviors correlate with satisfaction. They find that patterns like Clarifying Queries and Reacting to Response tend to align with lower satisfaction, while other patterns offer finer-grained insights into user needs and evaluation. The study contributes a scalable framework for automatic evaluation and realistic user simulation of conversational search experiences, with implications for personalization and system-guided information seeking.

Abstract

With large language models (LLMs), conversational search engines shift how users retrieve information from the web by enabling natural conversations to express their search intents over multiple turns. Users' natural conversation embodies rich but implicit signals of users' search intents and evaluation of search results to understand user experience with the system. However, it is underexplored how and why users ask follow-up queries to continue conversations with conversational search engines and how the follow-up queries signal users' satisfaction. From qualitative analysis of 250 conversational turns from an in-lab user evaluation of Naver Cue:, a commercial conversational search engine, we propose a taxonomy of 18 users' follow-up query patterns from conversational search, comprising two major axes: (1) users' motivations behind continuing conversations (N = 7) and (2) actions of follow-up queries (N = 11). Compared to the existing literature on query reformulations, we uncovered a new set of motivations and actions behind follow-up queries, including asking for subjective opinions or providing natural language feedback on the engine's responses. To analyze conversational search logs with our taxonomy in a scalable and efficient manner, we built an LLM-powered classifier (73% accuracy). With our classifier, we analyzed 2,061 conversational tuples collected from real-world usage logs of Cue: and examined how the conversation patterns from our taxonomy correlates with satisfaction. Our initial findings suggest some signals of dissatisfactions, such as Clarifying Queries, Excluding Condition, and Substituting Condition with follow-up queries. We envision our approach could contribute to automated evaluation of conversation search experience by providing satisfaction signals and grounds for realistic user simulations.

Using LLMs to Investigate Correlations of Conversational Follow-up Queries with User Satisfaction

TL;DR

This work investigates how follow-up queries in conversational search signal user satisfaction and intent. By constructing an 18-theme taxonomy across two axes—user motivation and follow-up patterns—from in-lab and real-world data, and by building an LLM-powered classifier (74%+ accuracy) to label logs at scale, the authors quantify how certain follow-up behaviors correlate with satisfaction. They find that patterns like Clarifying Queries and Reacting to Response tend to align with lower satisfaction, while other patterns offer finer-grained insights into user needs and evaluation. The study contributes a scalable framework for automatic evaluation and realistic user simulation of conversational search experiences, with implications for personalization and system-guided information seeking.

Abstract

With large language models (LLMs), conversational search engines shift how users retrieve information from the web by enabling natural conversations to express their search intents over multiple turns. Users' natural conversation embodies rich but implicit signals of users' search intents and evaluation of search results to understand user experience with the system. However, it is underexplored how and why users ask follow-up queries to continue conversations with conversational search engines and how the follow-up queries signal users' satisfaction. From qualitative analysis of 250 conversational turns from an in-lab user evaluation of Naver Cue:, a commercial conversational search engine, we propose a taxonomy of 18 users' follow-up query patterns from conversational search, comprising two major axes: (1) users' motivations behind continuing conversations (N = 7) and (2) actions of follow-up queries (N = 11). Compared to the existing literature on query reformulations, we uncovered a new set of motivations and actions behind follow-up queries, including asking for subjective opinions or providing natural language feedback on the engine's responses. To analyze conversational search logs with our taxonomy in a scalable and efficient manner, we built an LLM-powered classifier (73% accuracy). With our classifier, we analyzed 2,061 conversational tuples collected from real-world usage logs of Cue: and examined how the conversation patterns from our taxonomy correlates with satisfaction. Our initial findings suggest some signals of dissatisfactions, such as Clarifying Queries, Excluding Condition, and Substituting Condition with follow-up queries. We envision our approach could contribute to automated evaluation of conversation search experience by providing satisfaction signals and grounds for realistic user simulations.
Paper Structure (18 sections, 7 figures, 3 tables)

This paper contains 18 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Frequency distribution of our taxonomy from the real-world data
  • Figure 2: The co-occurrences of themes between Axes 1 and 2
  • Figure 3: Distribution of satisfaction score by each theme.
  • Figure 4: Result of pairwise comparison of satisfaction score by each theme. The colored squares signify statistically significant differences in satisfaction scores between the themes.
  • Figure 5: Distribution of relevance score by each theme.
  • ...and 2 more figures