Shadow loss: Memory-linear deep metric learning for efficient training

Alif Elham Khan; Mohammad Junayed Hasan; Humayra Anjum; Nabeel Mohammed

Shadow loss: Memory-linear deep metric learning for efficient training

Alif Elham Khan, Mohammad Junayed Hasan, Humayra Anjum, Nabeel Mohammed

TL;DR

Shadow Loss tackles the memory bottleneck in deep metric learning by replacing the $O(S\cdot D)$ loss buffer with a memory-linear $O(S)$ surrogate that computes similarity via scalar projections onto the anchor direction. It remains proxy-free and parameter-free while preserving the triplet structure, and it benefits from a 2-Lipschitz, stable gradient in the normalized anchor leading to faster convergence and tighter embeddings. Empirically, it improves Recall@K and silhouette scores across fine-grained, large-scale, standard, and medical imaging benchmarks, and requires roughly $1.5$–$2$× fewer epochs under identical backbones and mining. Additionally, Shadow Loss reduces peak VRAM by about $17$–$21\%$, enabling memory-efficient training on edge devices and large-scale systems by reusing batch dot-products and decoupling discriminative power from embedding dimensionality.

Abstract

Deep metric learning objectives (e.g., triplet loss) require storing and comparing high-dimensional embeddings, making the per-batch loss buffer scale as $O(S\cdot D)$, where $S$ is the number of samples in a batch and $D$ is the feature dimension, thus limiting training on memory-constrained hardware. We propose Shadow Loss, a proxy-free, parameter-free objective that measures similarity via scalar projections onto the anchor direction, reducing the loss-specific buffer from $O(S\cdot D)$ to $O(S)$ while preserving the triplet structure. We analyze gradients, provide a Lipschitz continuity bound, and show that Shadow Loss penalizes trivial collapse for stable optimization. Across fine-grained retrieval (CUB-200, CARS196), large-scale product retrieval (Stanford Online Products, In-Shop Clothes), and standard/medical benchmarks (CIFAR-10/100, Tiny-ImageNet, HAM-10K, ODIR-5K), Shadow Loss consistently outperforms recent objectives (Triplet, Soft-Margin Triplet, Angular Triplet, SoftTriple, Multi-Similarity). It also converges in $\approx 1.5\text{-}2\times$ fewer epochs under identical backbones and mining. Furthermore, it improves representation separability as measured by higher silhouette scores. The design is architecture-agnostic and vectorized for efficient implementation. By decoupling discriminative power from embedding dimensionality and reusing batch dot-products, Shadow Loss enables memory-linear training and faster convergence, making deep metric learning practical on both edge and large-scale systems.

Shadow loss: Memory-linear deep metric learning for efficient training

TL;DR

Shadow Loss tackles the memory bottleneck in deep metric learning by replacing the

loss buffer with a memory-linear

surrogate that computes similarity via scalar projections onto the anchor direction. It remains proxy-free and parameter-free while preserving the triplet structure, and it benefits from a 2-Lipschitz, stable gradient in the normalized anchor leading to faster convergence and tighter embeddings. Empirically, it improves Recall@K and silhouette scores across fine-grained, large-scale, standard, and medical imaging benchmarks, and requires roughly

–

× fewer epochs under identical backbones and mining. Additionally, Shadow Loss reduces peak VRAM by about

–

, enabling memory-efficient training on edge devices and large-scale systems by reusing batch dot-products and decoupling discriminative power from embedding dimensionality.

Abstract

Deep metric learning objectives (e.g., triplet loss) require storing and comparing high-dimensional embeddings, making the per-batch loss buffer scale as

, where

is the number of samples in a batch and

is the feature dimension, thus limiting training on memory-constrained hardware. We propose Shadow Loss, a proxy-free, parameter-free objective that measures similarity via scalar projections onto the anchor direction, reducing the loss-specific buffer from

while preserving the triplet structure. We analyze gradients, provide a Lipschitz continuity bound, and show that Shadow Loss penalizes trivial collapse for stable optimization. Across fine-grained retrieval (CUB-200, CARS196), large-scale product retrieval (Stanford Online Products, In-Shop Clothes), and standard/medical benchmarks (CIFAR-10/100, Tiny-ImageNet, HAM-10K, ODIR-5K), Shadow Loss consistently outperforms recent objectives (Triplet, Soft-Margin Triplet, Angular Triplet, SoftTriple, Multi-Similarity). It also converges in

fewer epochs under identical backbones and mining. Furthermore, it improves representation separability as measured by higher silhouette scores. The design is architecture-agnostic and vectorized for efficient implementation. By decoupling discriminative power from embedding dimensionality and reusing batch dot-products, Shadow Loss enables memory-linear training and faster convergence, making deep metric learning practical on both edge and large-scale systems.

Paper Structure (14 sections, 19 equations, 2 figures, 6 tables, 1 algorithm)

This paper contains 14 sections, 19 equations, 2 figures, 6 tables, 1 algorithm.

Introduction
Related work
Proposed methodology
Background: Triplet loss
Triplet selection
Shadow loss
Convergence Analysis
Pseudocode for shadow loss
Experiments
Implementation details
Results
Analysis
Ablation studies
Conclusion

Figures (2)

Figure 1: Shadow Loss vs Triplet Loss: Shadow Loss measures the distance between the projections of positive/ negative samples and the anchor. Whereas Triplet Loss measures the angular distance between them. (a) Training step where the positive moves toward the anchor and the negative moves away. (b) After Triplet Loss, the positive is nearer and the negative farther, but all embeddings stay in their original plane. (c) Shadow Loss first projects the positive and negative onto the anchor’s axis, then draws the positive projection closer and pushes the negative projection away. (d) Post-update, anchor and projections share the same plane; the positive projection sits close to the anchor, the negative projection remains distant.
Figure 2: t-SNE embeddings on CIFAR-10. Triplet Loss (left) shows overlapping strands; Shadow Loss (right) yields compact, well-separated clusters, consistent with higher silhouette scores in the main results tables.

Shadow loss: Memory-linear deep metric learning for efficient training

TL;DR

Abstract

Shadow loss: Memory-linear deep metric learning for efficient training

Authors

TL;DR

Abstract

Table of Contents

Figures (2)