Table of Contents
Fetching ...

Is Interpretable Machine Learning Effective at Feature Selection for Neural Learning-to-Rank?

Lijun Lyu, Nirmal Roy, Harrie Oosterhuis, Avishek Anand

TL;DR

This paper tackles interpretability and efficiency in neural learning-to-rank (LTR) by adapting six interpretable ML feature-selection methods and introducing a novel global variant, G-L2x, for embedded feature selection. It reveals substantial feature redundancy in neural LTR benchmarks and shows that TabNet can approximate optimal ranking with fewer than 10 features, while G-L2x can dramatically reduce feature retrieval costs with modest performance loss. Globally-oriented methods generally show robustness to incomplete input, making them attractive for production settings, whereas some local methods falter when inputs are truncated. Overall, the work demonstrates that carefully adapted interpretable ML techniques can both illuminate model behavior and yield practical efficiency gains in neural LTR, with code made publicly available for reproducibility.

Abstract

Neural ranking models have become increasingly popular for real-world search and recommendation systems in recent years. Unlike their tree-based counterparts, neural models are much less interpretable. That is, it is very difficult to understand their inner workings and answer questions like how do they make their ranking decisions? or what document features do they find important? This is particularly disadvantageous since interpretability is highly important for real-world systems. In this work, we explore feature selection for neural learning-to-rank (LTR). In particular, we investigate six widely-used methods from the field of interpretable machine learning (ML) and introduce our own modification, to select the input features that are most important to the ranking behavior. To understand whether these methods are useful for practitioners, we further study whether they contribute to efficiency enhancement. Our experimental results reveal a large feature redundancy in several LTR benchmarks: the local selection method TabNet can achieve optimal ranking performance with less than 10 features; the global methods, particularly our G-L2X, require slightly more selected features, but exhibit higher potential in improving efficiency. We hope that our analysis of these feature selection methods will bring the fields of interpretable ML and LTR closer together.

Is Interpretable Machine Learning Effective at Feature Selection for Neural Learning-to-Rank?

TL;DR

This paper tackles interpretability and efficiency in neural learning-to-rank (LTR) by adapting six interpretable ML feature-selection methods and introducing a novel global variant, G-L2x, for embedded feature selection. It reveals substantial feature redundancy in neural LTR benchmarks and shows that TabNet can approximate optimal ranking with fewer than 10 features, while G-L2x can dramatically reduce feature retrieval costs with modest performance loss. Globally-oriented methods generally show robustness to incomplete input, making them attractive for production settings, whereas some local methods falter when inputs are truncated. Overall, the work demonstrates that carefully adapted interpretable ML techniques can both illuminate model behavior and yield practical efficiency gains in neural LTR, with code made publicly available for reproducibility.

Abstract

Neural ranking models have become increasingly popular for real-world search and recommendation systems in recent years. Unlike their tree-based counterparts, neural models are much less interpretable. That is, it is very difficult to understand their inner workings and answer questions like how do they make their ranking decisions? or what document features do they find important? This is particularly disadvantageous since interpretability is highly important for real-world systems. In this work, we explore feature selection for neural learning-to-rank (LTR). In particular, we investigate six widely-used methods from the field of interpretable machine learning (ML) and introduce our own modification, to select the input features that are most important to the ranking behavior. To understand whether these methods are useful for practitioners, we further study whether they contribute to efficiency enhancement. Our experimental results reveal a large feature redundancy in several LTR benchmarks: the local selection method TabNet can achieve optimal ranking performance with less than 10 features; the global methods, particularly our G-L2X, require slightly more selected features, but exhibit higher potential in improving efficiency. We hope that our analysis of these feature selection methods will bring the fields of interpretable ML and LTR closer together.
Paper Structure (12 sections, 3 equations, 3 figures, 2 tables)

This paper contains 12 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Methods overview, as described in Section \ref{['sec:methods']}.
  • Figure 2: Results of three fixed-budget methods applied to scenario 1. The x-axis indicates the pre-specified percentile of selected features ($k$). The shaded area shows the standard deviation over 5 random seeds.
  • Figure 3: Scenario 2. Feature cost (left two) and ranking performance (right two) under incomplete input. The x-axis indicates how many percentages of features are present in the input, to test the trained ranking model. Note this differs from specifying $k$ during training for fixed-budget methods in scenario 1.