Shadow loss: Memory-linear deep metric learning for efficient training
Alif Elham Khan, Mohammad Junayed Hasan, Humayra Anjum, Nabeel Mohammed
TL;DR
Shadow Loss tackles the memory bottleneck in deep metric learning by replacing the $O(S\cdot D)$ loss buffer with a memory-linear $O(S)$ surrogate that computes similarity via scalar projections onto the anchor direction. It remains proxy-free and parameter-free while preserving the triplet structure, and it benefits from a 2-Lipschitz, stable gradient in the normalized anchor leading to faster convergence and tighter embeddings. Empirically, it improves Recall@K and silhouette scores across fine-grained, large-scale, standard, and medical imaging benchmarks, and requires roughly $1.5$–$2$× fewer epochs under identical backbones and mining. Additionally, Shadow Loss reduces peak VRAM by about $17$–$21\%$, enabling memory-efficient training on edge devices and large-scale systems by reusing batch dot-products and decoupling discriminative power from embedding dimensionality.
Abstract
Deep metric learning objectives (e.g., triplet loss) require storing and comparing high-dimensional embeddings, making the per-batch loss buffer scale as $O(S\cdot D)$, where $S$ is the number of samples in a batch and $D$ is the feature dimension, thus limiting training on memory-constrained hardware. We propose Shadow Loss, a proxy-free, parameter-free objective that measures similarity via scalar projections onto the anchor direction, reducing the loss-specific buffer from $O(S\cdot D)$ to $O(S)$ while preserving the triplet structure. We analyze gradients, provide a Lipschitz continuity bound, and show that Shadow Loss penalizes trivial collapse for stable optimization. Across fine-grained retrieval (CUB-200, CARS196), large-scale product retrieval (Stanford Online Products, In-Shop Clothes), and standard/medical benchmarks (CIFAR-10/100, Tiny-ImageNet, HAM-10K, ODIR-5K), Shadow Loss consistently outperforms recent objectives (Triplet, Soft-Margin Triplet, Angular Triplet, SoftTriple, Multi-Similarity). It also converges in $\approx 1.5\text{-}2\times$ fewer epochs under identical backbones and mining. Furthermore, it improves representation separability as measured by higher silhouette scores. The design is architecture-agnostic and vectorized for efficient implementation. By decoupling discriminative power from embedding dimensionality and reusing batch dot-products, Shadow Loss enables memory-linear training and faster convergence, making deep metric learning practical on both edge and large-scale systems.
