DSL: Understanding and Improving Softmax Recommender Systems with Competition-Aware Scaling
Bucher Sahyouni, Matthew Vowels, Liqun Chen, Simon Hadfield
TL;DR
DSL targets instability in softmax-based recommender training caused by a single global temperature and uniformly sampled negatives. It introduces two complementary branches: a within-example κ-branch that reweights negatives using hardness and item–item similarity, and a competition-aware (CA) branch that forms a top competitor slate and assigns per-example temperatures based on competition intensity, all while normalising to avoid logit drift. The approach yields distributionally robust improvements, supported by a KL-DRO interpretation and metric-aligned gradient estimates, with empirical gains across multiple datasets and backbones, especially under distribution shifts and for tail items. The work preserves the Softmax Loss backbone while reshaping the competition geometry to focus learning on the most informative substitutes, improving both accuracy and robustness in implicit-feedback recommender systems.
Abstract
Softmax Loss (SL) is being increasingly adopted for recommender systems (RS) as it has demonstrated better performance, robustness and fairness. Yet in implicit-feedback, a single global temperature and equal treatment of uniformly sampled negatives can lead to brittle training, because sampled sets may contain varying degrees of relevant or informative competitors. The optimal loss sharpness for a user-item pair with a particular set of negatives, can be suboptimal or destabilising for another with different negatives. We introduce Dual-scale Softmax Loss (DSL), which infers effective sharpness from the sampled competition itself. DSL adds two complementary branches to the log-sum-exp backbone. Firstly it reweights negatives within each training instance using hardness and item--item similarity, secondly it adapts a per-example temperature from the competition intensity over a constructed competitor slate. Together, these components preserve the geometry of SL while reshaping the competition distribution across negatives and across examples. Over several representative benchmarks and backbones, DSL yields substantial gains over strong baselines, with improvements over SL exceeding $10%$ in several settings and averaging $6.22%$ across datasets, metrics, and backbones. Under out-of-distribution (OOD) popularity shift, the gains are larger, with an average of $9.31%$ improvement over SL. We further provide a theoretical, distributionally robust optimisation (DRO) analysis, which demonstrates how DSL reshapes the robust payoff and the KL deviation for ambiguous instances. This helps explain the empirically observed improvements in accuracy and robustness.
