Is Interpretable Machine Learning Effective at Feature Selection for Neural Learning-to-Rank?
Lijun Lyu, Nirmal Roy, Harrie Oosterhuis, Avishek Anand
TL;DR
This paper tackles interpretability and efficiency in neural learning-to-rank (LTR) by adapting six interpretable ML feature-selection methods and introducing a novel global variant, G-L2x, for embedded feature selection. It reveals substantial feature redundancy in neural LTR benchmarks and shows that TabNet can approximate optimal ranking with fewer than 10 features, while G-L2x can dramatically reduce feature retrieval costs with modest performance loss. Globally-oriented methods generally show robustness to incomplete input, making them attractive for production settings, whereas some local methods falter when inputs are truncated. Overall, the work demonstrates that carefully adapted interpretable ML techniques can both illuminate model behavior and yield practical efficiency gains in neural LTR, with code made publicly available for reproducibility.
Abstract
Neural ranking models have become increasingly popular for real-world search and recommendation systems in recent years. Unlike their tree-based counterparts, neural models are much less interpretable. That is, it is very difficult to understand their inner workings and answer questions like how do they make their ranking decisions? or what document features do they find important? This is particularly disadvantageous since interpretability is highly important for real-world systems. In this work, we explore feature selection for neural learning-to-rank (LTR). In particular, we investigate six widely-used methods from the field of interpretable machine learning (ML) and introduce our own modification, to select the input features that are most important to the ranking behavior. To understand whether these methods are useful for practitioners, we further study whether they contribute to efficiency enhancement. Our experimental results reveal a large feature redundancy in several LTR benchmarks: the local selection method TabNet can achieve optimal ranking performance with less than 10 features; the global methods, particularly our G-L2X, require slightly more selected features, but exhibit higher potential in improving efficiency. We hope that our analysis of these feature selection methods will bring the fields of interpretable ML and LTR closer together.
