Latent Point Collapse on a Low Dimensional Embedding in Deep Neural Network Classifiers

Luigi Sbailò; Luca Ghiringhelli

Latent Point Collapse on a Low Dimensional Embedding in Deep Neural Network Classifiers

Luigi Sbailò, Luca Ghiringhelli

TL;DR

This work targets robust, discriminative latent representations in deep classifiers by introducing latent point collapse (LPC): a straightforward regularization that adds a strong $L_2$ penalty on the penultimate-layer latent vector $\mathbf{z}$ to the usual cross-entropy objective, forming a push-pull with the classification loss. As $\gamma$ grows, latent representations from the same class converge to a single point on a fixed-radius shell, yielding Lipschitz continuity and dramatic gains in robustness to input perturbations, alongside improved feature separability. The approach is demonstrated with low-dimensional linear penultimate layers, yielding binary-like latent encoding and convergence toward neural-collapse-like geometry, while remaining compatible with margin-based losses and IB interpretations. Overall, LPC provides a simple, effective regularization that enhances robustness and discriminative embeddings with minimal architectural changes, and can be combined with existing regularizers for further gains.

Abstract

The configuration of latent representations plays a critical role in determining the performance of deep neural network classifiers. In particular, the emergence of well-separated class embeddings in the latent space has been shown to improve both generalization and robustness. In this paper, we propose a method to induce the collapse of latent representations belonging to the same class into a single point, which enhances class separability in the latent space while enforcing Lipschitz continuity in the network. We demonstrate that this phenomenon, which we call \textit{latent point collapse}, is achieved by adding a strong $L_2$ penalty on the penultimate-layer representations and is the result of a push-pull tension developed with the cross-entropy loss function. In addition, we show the practical utility of applying this compressing loss term to the latent representations of a low-dimensional linear penultimate layer. The proposed approach is straightforward to implement and yields substantial improvements in discriminative feature embeddings, along with remarkable gains in robustness to input perturbations.

Latent Point Collapse on a Low Dimensional Embedding in Deep Neural Network Classifiers

TL;DR

This work targets robust, discriminative latent representations in deep classifiers by introducing latent point collapse (LPC): a straightforward regularization that adds a strong

penalty on the penultimate-layer latent vector

to the usual cross-entropy objective, forming a push-pull with the classification loss. As

grows, latent representations from the same class converge to a single point on a fixed-radius shell, yielding Lipschitz continuity and dramatic gains in robustness to input perturbations, alongside improved feature separability. The approach is demonstrated with low-dimensional linear penultimate layers, yielding binary-like latent encoding and convergence toward neural-collapse-like geometry, while remaining compatible with margin-based losses and IB interpretations. Overall, LPC provides a simple, effective regularization that enhances robustness and discriminative embeddings with minimal architectural changes, and can be combined with existing regularizers for further gains.

Abstract

penalty on the penultimate-layer representations and is the result of a push-pull tension developed with the cross-entropy loss function. In addition, we show the practical utility of applying this compressing loss term to the latent representations of a low-dimensional linear penultimate layer. The proposed approach is straightforward to implement and yields substantial improvements in discriminative feature embeddings, along with remarkable gains in robustness to input perturbations.

Paper Structure (26 sections, 4 theorems, 49 equations, 5 figures, 4 tables)

This paper contains 26 sections, 4 theorems, 49 equations, 5 figures, 4 tables.

Introduction
Contributions
Related Works
Method
Binary encoding
Information bottleneck
Experiments
Latent Point Collapse
Robustness and generalization
Discussion
Conclusion
Acknowledgements
Derivation of Latent Point Collapse
Detailed Radial Analysis Toward Equilibrium
Behavior for large $\|\boldsymbol{z}\|$.
...and 11 more sections

Key Result

Lemma 1

Under Assumption assump:separation, there exists $\delta>0$ such that Let $\alpha = e^{-\delta} < 1$. Then for $i\neq \overline{y}$, and therefore Hence $p_{\overline{y}}\approx 1$ and $p_i\approx 0$ for $i\neq \overline{y}$ in the terminal phase.

Figures (5)

Figure 1: Visualization of the LPC phenomenon with increasing regularization coefficient $\gamma$. (1) We assume class separation in the TPT, meaning that the latent representations of different classes are linearly separable. At low values of $\gamma$, representations are still pushed towards the origin but remain spread around a distance $R$ from it. (2) As $\gamma$ increases, both the radius $R$ and the spread begin to decrease. (3) At high $\gamma$, points collapse to distinct locations while maintaining class separation. The arrows illustrate the competing forces: the blue arrows represent the cross-entropy term driving class separation from the origin, tending to increase the norm of latent representations; the red arrows indicate the compression force ($\gamma\|\mathbf{z}\|^2$) pulling points inward; and the green arrows show the cross-entropy effects pushing different classes apart along the shell, thus promoting convergence within each class. This graphical illustration was created using synthetic data to demonstrate the development of the LPC phenomenon, which is described in detail in App. \ref{['App:latent_point_collapse']}.
Figure 2: Graphical illustration of the dynamics leading to the emergence of a latent binary encoding. The three images provide a qualitative representation of the training process, where the scalar $\gamma$ is progressively increased. The plots in the images represent histograms of the latent representations at a specific node of the linear penultimate layer. In the first image, the relatively low value of $\gamma$ constrains all values close to the origin, but the volume remains large enough for the network to differentiate between different classes. As $\gamma$ increases, all latent values are drawn closer to the origin, as depicted in the second image, making it increasingly difficult for the network to discriminate between elements of different classes. Consequently, the network is forced to find a more stable solution through numerical optimization, placing all elements belonging to the same class in the neighborhood of one of two points. In the two distributions shown in the third image, each of the two peaks contains elements from different classes, but all elements of a specific class are confined to a single peak. In other words, while a peak may contain multiple classes, all elements of the same class are restricted to the same peak. Through numerical optimization, these two peaks eventually converge to single points, positioned opposite to each other with respect to the origin, as illustrated in the third image—an outcome facilitated by the linear layer. The red (green) arrow represents the net effect of the binary encoding (cross-entropy) loss. In relation to Fig. \ref{['fig:latent_collapse']}, this figure empirically demonstrates that the points of collapse are located on some vertices of a hypercube defined by the dimensions of the penultimate layer, which explains the binary structure observed in this figure.
Figure 3: Each point represents a different training instance, and points are taken from all evaluated architectures. As shown in the plot, there is a strong correlation between the class separation ratio $\mathcal{R}$ and the network’s robustness to input perturbations. The Pearson correlation coefficients are 0.97 for CIFAR10 and 0.985 for CIFAR100, respectively.
Figure 4: Log-likelihood scores, standard deviations and weighted peak distance of bimodal Gaussian mixture models fitted on each dimension of the penultimate layer using all values of the training set. From top to bottom, the quantities described as $\overline{\ell}$, $\overline{\sigma}$, and $\overline{\mu}$ in Eqs. \ref{['Eq:score_likelihood']}, \ref{['Eq:score_stds']}, \ref{['Eq:score_peaks']} are computed for the LPC, LPC-Wide, LPC-Narrow, LinPen architectures. We can see that the latent representations in the penultimate layers featuring $L_2$ loss are well represented by two Gaussians with increasingly small standard deviations, while this is not observed for the LinPen architectures.
Figure 5: Metrics used to evaluate convergence towards Neural Collapse (NC). In the upper figure, we examine a renormalized version of the NC1 property. This normalization process is conducted based on the number of nodes in the penultimate layer to ensure a fair comparison across models with varying dimensions of the penultimate layer. The dashed lines are drawn at the average epoch when training reaches convergence, that demonstrates that most of the training was performed in the TPT. Below, we present metrics demonstrating convergence to an ETFS, utilizing the same parameters as those outlined in neural_collapse.

Theorems & Definitions (4)

Lemma 1: Softmax Approximation in the Terminal Phase
Lemma 2: Radial Flow
Lemma 3: Volume Confinement via Angular Deviations
Theorem A.1: Latent Point Collapse

Latent Point Collapse on a Low Dimensional Embedding in Deep Neural Network Classifiers

TL;DR

Abstract

Latent Point Collapse on a Low Dimensional Embedding in Deep Neural Network Classifiers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)