Can Offline Metrics Measure Explanation Goals? A Comparative Survey Analysis of Offline Explanation Metrics in Recommender Systems
André Levi Zanon, Marcelo Garcia Manzato, Leonardo Rocha
TL;DR
The paper tackles the challenge of evaluating explanation goals in recommender systems using offline metrics. It introduces path-based explanations that connect interacted and recommended items via shared attributes and systematically studies how the choice of attributes and interacted items affects perception, across three KG-agnostic algorithms and six RSs. Through a two-stage approach—offline path-metric evaluation on MovieLens and LastFM KG datasets, followed by online user studies—the authors reveal partial alignment: attribute diversity strongly influences engagement, while both popularity and diversity impact transparency and trust. The findings highlight a gap between current offline metrics and true user perception, and propose guidelines and directions for developing offline metrics that better reflect explanation goals and user understanding.
Abstract
In Recommender System (RS), explanations help users understand why items are recommended and can enhance a system's transparency, persuasiveness, engagement, and trust, which are known as explanation goals. However, evaluating the effectiveness of explanation algorithms offline remains challenging because explanation goals are inherently subjective. We initially conducted a rapid literature review, which revealed that algorithms are often assessed using anecdotal evidence (offering convincing examples) or using metrics that do not align with human perception. From these results, we investigated whether the selection of item attributes and interacted items affects explanation goals in explanations that generate a path connecting interacted and recommended items based on shared attributes (such as genres). We used metrics that measure the diversity and popularity of attributes and the recency of item interactions to evaluate explanations from three state-of-the-art agnostic algorithms across six recommendation systems. We then performed an online user study to compare user perceptions of explanation goals and offline metrics. Our findings indicate that engagement is sensitive to users' perceptions of diversity in explanations, whereas transparency, trust, and persuasiveness are influenced by perceptions of both popularity and diversity. However, offline metrics require refinement to more closely align with explanation goals and user understanding.
