Table of Contents
Fetching ...

Session-level Normalization and Click-through Data Enhancement for Session-based Evaluation

Haonan Chen, Zhicheng Dou, Jiaxin Mao

TL;DR

The paper addresses the gap in session-based evaluation where per-query aggregation and fixed-query assumptions misalign with user behavior. It introduces Normalized U-Measure (NUM), which treats a session as a single virtual query and normalizes against an ideal session while deriving session-level relevance from click-through data via two guiding assumptions A1 and A2. NUM demonstrates stronger correlation with user satisfaction and higher intuitiveness than existing metrics, with ablation studies confirming the value of session-level normalization, reformulation-time penalties, and click-through enhancements. The work provides a practical framework for more accurate offline session evaluation using implicit feedback and offers insights for designing future session-level metrics.

Abstract

Since a user usually has to issue a sequence of queries and examine multiple documents to resolve a complex information need in a search session, researchers have paid much attention to evaluating search systems at the session level rather than the single-query level. Most existing session-level metrics evaluate each query separately and then aggregate the query-level scores using a session-level weighting function. The assumptions behind these metrics are that all queries in the session should be involved, and their orders are fixed. However, if a search system could make the user satisfied with her first few queries, she may not need any subsequent queries. Besides, in most real-world search scenarios, due to a lack of explicit feedback from real users, we can only leverage some implicit feedback, such as users' clicks, as relevance labels for offline evaluation. Such implicit feedback might be different from the real relevance in a search session as some documents may be omitted in the previous query but identified in the later reformulations. To address the above issues, we make two assumptions about session-based evaluation, which explicitly describe an ideal session-search system and how to enhance click-through data in computing session-level evaluation metrics. Based on our assumptions, we design a session-level metric called Normalized U-Measure (NUM). NUM evaluates a session as a whole and utilizes an ideal session to normalize the result of the actual session. Besides, it infers session-level relevance labels based on implicit feedback. Experiments on two public datasets demonstrate the effectiveness of NUM by comparing it with existing session-based metrics in terms of correlation with user satisfaction and intuitiveness. We also conduct ablation studies to explore whether these assumptions hold.

Session-level Normalization and Click-through Data Enhancement for Session-based Evaluation

TL;DR

The paper addresses the gap in session-based evaluation where per-query aggregation and fixed-query assumptions misalign with user behavior. It introduces Normalized U-Measure (NUM), which treats a session as a single virtual query and normalizes against an ideal session while deriving session-level relevance from click-through data via two guiding assumptions A1 and A2. NUM demonstrates stronger correlation with user satisfaction and higher intuitiveness than existing metrics, with ablation studies confirming the value of session-level normalization, reformulation-time penalties, and click-through enhancements. The work provides a practical framework for more accurate offline session evaluation using implicit feedback and offers insights for designing future session-level metrics.

Abstract

Since a user usually has to issue a sequence of queries and examine multiple documents to resolve a complex information need in a search session, researchers have paid much attention to evaluating search systems at the session level rather than the single-query level. Most existing session-level metrics evaluate each query separately and then aggregate the query-level scores using a session-level weighting function. The assumptions behind these metrics are that all queries in the session should be involved, and their orders are fixed. However, if a search system could make the user satisfied with her first few queries, she may not need any subsequent queries. Besides, in most real-world search scenarios, due to a lack of explicit feedback from real users, we can only leverage some implicit feedback, such as users' clicks, as relevance labels for offline evaluation. Such implicit feedback might be different from the real relevance in a search session as some documents may be omitted in the previous query but identified in the later reformulations. To address the above issues, we make two assumptions about session-based evaluation, which explicitly describe an ideal session-search system and how to enhance click-through data in computing session-level evaluation metrics. Based on our assumptions, we design a session-level metric called Normalized U-Measure (NUM). NUM evaluates a session as a whole and utilizes an ideal session to normalize the result of the actual session. Besides, it infers session-level relevance labels based on implicit feedback. Experiments on two public datasets demonstrate the effectiveness of NUM by comparing it with existing session-based metrics in terms of correlation with user satisfaction and intuitiveness. We also conduct ablation studies to explore whether these assumptions hold.
Paper Structure (25 sections, 8 equations, 2 figures, 6 tables)

This paper contains 25 sections, 8 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The illustration of how U-measure constructs trailtext from a two-query session. The results clicked by the user are marked with red checkmarks and the results marked as relevant are filled with color gray. The right part is the constructed trailtext, where "$s_i$" is the $i$-th string of it.
  • Figure 2: The illustration of NUM. The upper part is the actual session, and the lower part is the ideal session. The results clicked by the user are marked with red checkmarks, and the results marked as relevant are filled with color gray. Rank 4 of the first query is marked relevant even though it is not clicked here because it is clicked in the subsequent query. We treat a session as a virtual query, based on which we build a trailtext to enable session-level evaluation. We construct the trailtext based on actual user clicks for the actual session and based on the enhanced session-level labels for the ideal session.