Table of Contents
Fetching ...

Localizing Memorization in SSL Vision Encoders

Wenhao Wang, Adam Dziedzic, Michael Backes, Franziska Boenisch

TL;DR

This paper tackles the question of where memorization occurs inside self-supervised vision encoders. It introduces two practical localization tools, LayerMem for per-layer memorization and UnitMem for per-unit memorization, both computable in a forward pass without labels. Through extensive experiments on CNNs and vision transformers across diverse SSL frameworks and datasets, the authors find that memorization grows with depth but spans the entire encoder, with many units memorizing individual data points, especially for atypical samples, and that in ViTs memorization is concentrated in fully connected layers. They also show differential privacy can reduce unit memorization and demonstrate potential gains for targeted fine-tuning and pruning based on memorization signals. Overall, LayerMem and UnitMem offer actionable insights into memorization dynamics and practical pathways to safer, more efficient SSL fine-tuning and compression.

Abstract

Recent work on studying memorization in self-supervised learning (SSL) suggests that even though SSL encoders are trained on millions of images, they still memorize individual data points. While effort has been put into characterizing the memorized data and linking encoder memorization to downstream utility, little is known about where the memorization happens inside SSL encoders. To close this gap, we propose two metrics for localizing memorization in SSL encoders on a per-layer (layermem) and per-unit basis (unitmem). Our localization methods are independent of the downstream task, do not require any label information, and can be performed in a forward pass. By localizing memorization in various encoder architectures (convolutional and transformer-based) trained on diverse datasets with contrastive and non-contrastive SSL frameworks, we find that (1) while SSL memorization increases with layer depth, highly memorizing units are distributed across the entire encoder, (2) a significant fraction of units in SSL encoders experiences surprisingly high memorization of individual data points, which is in contrast to models trained under supervision, (3) atypical (or outlier) data points cause much higher layer and unit memorization than standard data points, and (4) in vision transformers, most memorization happens in the fully-connected layers. Finally, we show that localizing memorization in SSL has the potential to improve fine-tuning and to inform pruning strategies.

Localizing Memorization in SSL Vision Encoders

TL;DR

This paper tackles the question of where memorization occurs inside self-supervised vision encoders. It introduces two practical localization tools, LayerMem for per-layer memorization and UnitMem for per-unit memorization, both computable in a forward pass without labels. Through extensive experiments on CNNs and vision transformers across diverse SSL frameworks and datasets, the authors find that memorization grows with depth but spans the entire encoder, with many units memorizing individual data points, especially for atypical samples, and that in ViTs memorization is concentrated in fully connected layers. They also show differential privacy can reduce unit memorization and demonstrate potential gains for targeted fine-tuning and pruning based on memorization signals. Overall, LayerMem and UnitMem offer actionable insights into memorization dynamics and practical pathways to safer, more efficient SSL fine-tuning and compression.

Abstract

Recent work on studying memorization in self-supervised learning (SSL) suggests that even though SSL encoders are trained on millions of images, they still memorize individual data points. While effort has been put into characterizing the memorized data and linking encoder memorization to downstream utility, little is known about where the memorization happens inside SSL encoders. To close this gap, we propose two metrics for localizing memorization in SSL encoders on a per-layer (layermem) and per-unit basis (unitmem). Our localization methods are independent of the downstream task, do not require any label information, and can be performed in a forward pass. By localizing memorization in various encoder architectures (convolutional and transformer-based) trained on diverse datasets with contrastive and non-contrastive SSL frameworks, we find that (1) while SSL memorization increases with layer depth, highly memorizing units are distributed across the entire encoder, (2) a significant fraction of units in SSL encoders experiences surprisingly high memorization of individual data points, which is in contrast to models trained under supervision, (3) atypical (or outlier) data points cause much higher layer and unit memorization than standard data points, and (4) in vision transformers, most memorization happens in the fully-connected layers. Finally, we show that localizing memorization in SSL has the potential to improve fine-tuning and to inform pruning strategies.
Paper Structure (72 sections, 16 equations, 13 figures, 34 tables)

This paper contains 72 sections, 16 equations, 13 figures, 34 tables.

Figures (13)

  • Figure 1: Insights into UnitMem. We train a ResNet9 encoder with SimCLR: (a) Different datasets, including SVHN, CIFAR10, and STL10. We report the UnitMem of the last convolutional layer (conv4_2); (b) Comparing alignment between SSLMem and UnitMem on CIFAR10. Data points with higher general memorization (SSLMem) tend to experience higher UnitMem; (c) Using different strengths of privacy protection according to DP during training on CIFAR10 and Vit-Base
  • Figure 2: UnitMem and ClassMem for SL and SSL.
  • Figure 3: Significantly more (less) units memorize data points rather than classes in SSL (SL). We measure the ClassMem vs UnitMem for 10000 samples from CIFAR100, with 100 random samples per class. Each i-th column represents the i-th convolutional layer in ResNet9, with 8 convolution layers, where the 1st row is for SSL while the 2nd row for SL. The red diagonal line denotes $y=x$.
  • Figure 4: Average UnitMem of layer 8 over training.
  • Figure 5: UnitMem w & w/o augmentations.
  • ...and 8 more figures