Table of Contents
Fetching ...

A Differentiable Rank-Based Objective For Better Feature Learning

Krunoslav Lehman Pavasovic, David Lopez-Paz, Giulio Biroli, Levent Sagun

TL;DR

This work introduces difFOCI, a differentiable, rank-based objective that extends Chatterjee/Azadkia’s coefficients to machine learning. By constructing a differentiable surrogate $T_{n,eta}$ and offering vec-based and neural network parameterizations, the method unifies feature selection, feature learning, and regularization for fairness and robustness against spurious correlations. The approach achieves state-of-the-art or competitive performance in synthetic and real-world settings, improves worst-group accuracy under domain shift, and demonstrates debiasing capabilities without sacrificing predictive power. Overall, difFOCI provides a versatile framework for end-to-end learning that leverages rank-based dependence to guide feature learning and fairness-aware modeling.

Abstract

In this paper, we leverage existing statistical methods to better understand feature learning from data. We tackle this by modifying the model-free variable selection method, Feature Ordering by Conditional Independence (FOCI), which is introduced in \cite{azadkia2021simple}. While FOCI is based on a non-parametric coefficient of conditional dependence, we introduce its parametric, differentiable approximation. With this approximate coefficient of correlation, we present a new algorithm called difFOCI, which is applicable to a wider range of machine learning problems thanks to its differentiable nature and learnable parameters. We present difFOCI in three contexts: (1) as a variable selection method with baseline comparisons to FOCI, (2) as a trainable model parametrized with a neural network, and (3) as a generic, widely applicable neural network regularizer, one that improves feature learning with better management of spurious correlations. We evaluate difFOCI on increasingly complex problems ranging from basic variable selection in toy examples to saliency map comparisons in convolutional networks. We then show how difFOCI can be incorporated in the context of fairness to facilitate classifications without relying on sensitive data.

A Differentiable Rank-Based Objective For Better Feature Learning

TL;DR

This work introduces difFOCI, a differentiable, rank-based objective that extends Chatterjee/Azadkia’s coefficients to machine learning. By constructing a differentiable surrogate and offering vec-based and neural network parameterizations, the method unifies feature selection, feature learning, and regularization for fairness and robustness against spurious correlations. The approach achieves state-of-the-art or competitive performance in synthetic and real-world settings, improves worst-group accuracy under domain shift, and demonstrates debiasing capabilities without sacrificing predictive power. Overall, difFOCI provides a versatile framework for end-to-end learning that leverages rank-based dependence to guide feature learning and fairness-aware modeling.

Abstract

In this paper, we leverage existing statistical methods to better understand feature learning from data. We tackle this by modifying the model-free variable selection method, Feature Ordering by Conditional Independence (FOCI), which is introduced in \cite{azadkia2021simple}. While FOCI is based on a non-parametric coefficient of conditional dependence, we introduce its parametric, differentiable approximation. With this approximate coefficient of correlation, we present a new algorithm called difFOCI, which is applicable to a wider range of machine learning problems thanks to its differentiable nature and learnable parameters. We present difFOCI in three contexts: (1) as a variable selection method with baseline comparisons to FOCI, (2) as a trainable model parametrized with a neural network, and (3) as a generic, widely applicable neural network regularizer, one that improves feature learning with better management of spurious correlations. We evaluate difFOCI on increasingly complex problems ranging from basic variable selection in toy examples to saliency map comparisons in convolutional networks. We then show how difFOCI can be incorporated in the context of fairness to facilitate classifications without relying on sensitive data.

Paper Structure

This paper contains 67 sections, 6 theorems, 14 equations, 5 figures, 19 tables, 5 algorithms.

Key Result

Theorem 1

chatterjee2020original If $Y$ is not almost surely a constant, then as $n \rightarrow \infty$, $\xi_n(X, Y)$ converges almost surely to the deterministic limit $\xi(X, Y)$.

Figures (5)

  • Figure 1: Synthetic dataset experiment, detailed in Sec. \ref{['sec:preliminary_synthetic_study']}. Out of 240 total features, our vec-\ref{['dF1']} selects three informative, yet diverse features (corresponding to norms $0.27$, $0.23$, and $0.18$).
  • Figure 2: ResNet-50 he2016deep saliency maps using the ERM vapnik2006estimation loss, DRO sagawa2019distributionally with standard regularization (early stopping and $\ell2$) or difFOCI. Without difFOCI, the models heavily rely on background (spurious features). difFOCI effectively resolves the problem (main focus is on relevant features: the bird). Further samples are shown in the Appendix \ref{['appx:sec_E']}.
  • Figure 3: Left: Mean and std. across 50 random inits. All expressions yield values significantly greater than zero. Right: Development of the first five parameters in Toy Exp 1.
  • Figure 4: Five randomly selected samples along with their corresponding saliency maps. In some cases, ERM and gDRO do not rely on the background (as seen in the last row), but they do for others. In these instances, difFOCI reduces the reliance on the background, which can be observed clearly in rows 1, 2, and 3, and to a lesser extent in row 4.
  • Figure 5: Five randomly selected samples along with their corresponding saliency maps. It is evident that difFOCI has a more pronounced effect in reducing the reliance on the background for ERM compared to DRO. In most cases, the reliance is significantly reduced for ERM (e.g., rows 2, 3, 4, and 5). For DRO, the improvement is less pronounced, with potential minor improvements in rows 3 and 4.

Theorems & Definitions (7)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • proof