Table of Contents
Fetching ...

Analytical Softmax Temperature Setting from Feature Dimensions for Model- and Domain-Robust Classification

Tatsuhito Hasegawa, Shunsuke Sakai

TL;DR

The paper addresses the challenge of choosing the softmax temperature $T$ in classification by establishing a training-free estimator for the optimal temperature $T^*$ that keyly depends on feature-map dimensionality $M$ and task properties. It introduces a closed-form baseline $T^* \,\approx\, \alpha \sqrt{M} + \beta$ and refines it with task-aware corrections via $\gamma \log(\text{csg}) + \delta \log(\text{cn})$, augmented by inserting a batch normalization layer before the output to stabilize the relation. Empirical results across CNNs and vision transformers on multiple datasets show that the proposed estimator often outperforms the default $T=1$ and remains robust to architectural and dataset changes, although interactions with label smoothing can attenuate gains in some cases. The approach offers a practical, training-free solution for robust temperature setting, with significant implications for generalization and deployment efficiency in deep learning classifiers. Limitations include scope to image classification, potential refinements to coefficient optimization, and exploration of dynamic $T$ schedules during training.

Abstract

In deep learning-based classification tasks, the softmax function's temperature parameter $T$ critically influences the output distribution and overall performance. This study presents a novel theoretical insight that the optimal temperature $T^*$ is uniquely determined by the dimensionality of the feature representations, thereby enabling training-free determination of $T^*$. Despite this theoretical grounding, empirical evidence reveals that $T^*$ fluctuates under practical conditions owing to variations in models, datasets, and other confounding factors. To address these influences, we propose and optimize a set of temperature determination coefficients that specify how $T^*$ should be adjusted based on the theoretical relationship to feature dimensionality. Additionally, we insert a batch normalization layer immediately before the output layer, effectively stabilizing the feature space. Building on these coefficients and a suite of large-scale experiments, we develop an empirical formula to estimate $T^*$ without additional training while also introducing a corrective scheme to refine $T^*$ based on the number of classes and task complexity. Our findings confirm that the derived temperature not only aligns with the proposed theoretical perspective but also generalizes effectively across diverse tasks, consistently enhancing classification performance and offering a practical, training-free solution for determining $T^*$.

Analytical Softmax Temperature Setting from Feature Dimensions for Model- and Domain-Robust Classification

TL;DR

The paper addresses the challenge of choosing the softmax temperature in classification by establishing a training-free estimator for the optimal temperature that keyly depends on feature-map dimensionality and task properties. It introduces a closed-form baseline and refines it with task-aware corrections via , augmented by inserting a batch normalization layer before the output to stabilize the relation. Empirical results across CNNs and vision transformers on multiple datasets show that the proposed estimator often outperforms the default and remains robust to architectural and dataset changes, although interactions with label smoothing can attenuate gains in some cases. The approach offers a practical, training-free solution for robust temperature setting, with significant implications for generalization and deployment efficiency in deep learning classifiers. Limitations include scope to image classification, potential refinements to coefficient optimization, and exploration of dynamic schedules during training.

Abstract

In deep learning-based classification tasks, the softmax function's temperature parameter critically influences the output distribution and overall performance. This study presents a novel theoretical insight that the optimal temperature is uniquely determined by the dimensionality of the feature representations, thereby enabling training-free determination of . Despite this theoretical grounding, empirical evidence reveals that fluctuates under practical conditions owing to variations in models, datasets, and other confounding factors. To address these influences, we propose and optimize a set of temperature determination coefficients that specify how should be adjusted based on the theoretical relationship to feature dimensionality. Additionally, we insert a batch normalization layer immediately before the output layer, effectively stabilizing the feature space. Building on these coefficients and a suite of large-scale experiments, we develop an empirical formula to estimate without additional training while also introducing a corrective scheme to refine based on the number of classes and task complexity. Our findings confirm that the derived temperature not only aligns with the proposed theoretical perspective but also generalizes effectively across diverse tasks, consistently enhancing classification performance and offering a practical, training-free solution for determining .

Paper Structure

This paper contains 30 sections, 18 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Sample outputs of the softmax function at various temperatures for a fixed model output $\hat{{\bm y}}$.
  • Figure 2: Effect of temperature variation and label smoothing on cross-entropy loss.
  • Figure 3: Outline of the flow of the general deep neural network models inserted a normalization layer.
  • Figure 4: Change in accuracy relative to $T$ in the CIFAR10 environment using VGG9 with $M = 512$. The results for $T = 256$ and $T = 512$ are below the drawing range.
  • Figure 5: Test accuracies [%] for each temperature parameter in various scenarios (without insertion of the normalization layer). Only CIFAR-100@10 has 10 classes extracted from its superclasses (using only even class numbers) to standardize the number of output units to 10.
  • ...and 6 more figures