Table of Contents
Fetching ...

Decoupled Entropy Minimization

Jing Ma, Hanlin Li, Xiang Xiang

TL;DR

This paper addresses limitations of Entropy Minimization (EM) by reformulating the conditional entropy $H(\mathbf{z})$ into a decoupled sum of a Cluster Aggregation Driving Factor (CADF) and a Gradient Mitigation Calibrator (GMC), expressed as $H(\mathbf{z}) = T_\tau(\mathbf{z}) + Q_\alpha(\mathbf{z})$. It identifies reward collapse and easy-class bias arising from the coupled EM and analyzes them through the CADF/GMC lens. To overcome these issues, it introduces Adaptive Decoupled Entropy Minimization (AdaDEM): normalizing the CADF reward with $\delta = \| - \partial T(\mathbf{z})/\partial \mathbf{z} \|_1$ and replacing GMC with Marginal Entropy Calibrator (MEC), a hyperparameter-free, dynamically estimated prior that mitigates easy-class bias. Across SSL, TTA, UDA, and RL tasks, AdaDEM achieves superior performance and robustness compared to DEM*, demonstrating EM’s potential when decoupled and adaptively regularized.

Abstract

Entropy Minimization (EM) is beneficial to reducing class overlap, bridging domain gap, and restricting uncertainty for various tasks in machine learning, yet its potential is limited. To study the internal mechanism of EM, we reformulate and decouple the classical EM into two parts with opposite effects: cluster aggregation driving factor (CADF) rewards dominant classes and prompts a peaked output distribution, while gradient mitigation calibrator (GMC) penalizes high-confidence classes based on predicted probabilities. Furthermore, we reveal the limitations of classical EM caused by its coupled formulation: 1) reward collapse impedes the contribution of high-certainty samples in the learning process, and 2) easy-class bias induces misalignment between output distribution and label distribution. To address these issues, we propose Adaptive Decoupled Entropy Minimization (AdaDEM), which normalizes the reward brought from CADF and employs a marginal entropy calibrator (MEC) to replace GMC. AdaDEM outperforms DEM*, an upper-bound variant of classical EM, and achieves superior performance across various imperfectly supervised learning tasks in noisy and dynamic environments.

Decoupled Entropy Minimization

TL;DR

This paper addresses limitations of Entropy Minimization (EM) by reformulating the conditional entropy into a decoupled sum of a Cluster Aggregation Driving Factor (CADF) and a Gradient Mitigation Calibrator (GMC), expressed as . It identifies reward collapse and easy-class bias arising from the coupled EM and analyzes them through the CADF/GMC lens. To overcome these issues, it introduces Adaptive Decoupled Entropy Minimization (AdaDEM): normalizing the CADF reward with and replacing GMC with Marginal Entropy Calibrator (MEC), a hyperparameter-free, dynamically estimated prior that mitigates easy-class bias. Across SSL, TTA, UDA, and RL tasks, AdaDEM achieves superior performance and robustness compared to DEM*, demonstrating EM’s potential when decoupled and adaptively regularized.

Abstract

Entropy Minimization (EM) is beneficial to reducing class overlap, bridging domain gap, and restricting uncertainty for various tasks in machine learning, yet its potential is limited. To study the internal mechanism of EM, we reformulate and decouple the classical EM into two parts with opposite effects: cluster aggregation driving factor (CADF) rewards dominant classes and prompts a peaked output distribution, while gradient mitigation calibrator (GMC) penalizes high-confidence classes based on predicted probabilities. Furthermore, we reveal the limitations of classical EM caused by its coupled formulation: 1) reward collapse impedes the contribution of high-certainty samples in the learning process, and 2) easy-class bias induces misalignment between output distribution and label distribution. To address these issues, we propose Adaptive Decoupled Entropy Minimization (AdaDEM), which normalizes the reward brought from CADF and employs a marginal entropy calibrator (MEC) to replace GMC. AdaDEM outperforms DEM*, an upper-bound variant of classical EM, and achieves superior performance across various imperfectly supervised learning tasks in noisy and dynamic environments.

Paper Structure

This paper contains 41 sections, 1 theorem, 22 equations, 20 figures, 15 tables, 1 algorithm.

Key Result

Proposition A.1

The valid value of temperature $\tau$ in Decoupled Entropy Minimization is $0 < \tau \le 2 / \alpha$ where $\alpha > 0$.

Figures (20)

  • Figure 1: EM is decoupled into two parts with opposite effects: CADF and GMC. DEM* softens the model's prediction via temperature $\tau$ and scales GMC via weight $\alpha$, searching for optimal $(\tau*, \alpha*)$ to maximize classical EM's performance. AdaDEM normalizes CADF reward by $\delta$ (L1-norm of the gradients) to prevent reward collapse, and replaces GMC with Marginal Entropy Calibrator (MEC, i.e., $\overline{\mathfrak{p}}_{k}^t$) to reduce easy-class bias.
  • Figure 1: (Left) Ablation studies in single-domain and continual TTA tasks. DEM* searches optimal hyperparameters $(\tau*, \alpha*)$ on a subset of target data. $\Delta$ denotes the performance improvement relative to the baseline. (Right) The sensitivity testing of learning rates demonstrates that AdaDEM has expanded 10x tolerance range compared to classical EM.
  • Figure 2: (Left: Reward Collapse) We compare the gradient magnitudes of the classical EM for samples with different predicted probabilities, which collapse to $0.0$ when the maximum probabilities approach $1.0$. (Right: Easy-Class Bias) The output distribution of ViT-B/16 after test-time adaptation using classical EM and our AdaDEM on a class-balanced Gaussian-noise-corrupted ImageNet-C benchmark, with all classes sorted by their predicted proportions in descending order.
  • Figure 3: (Left) Reward curves of DEM with varying $\tau$ values for a $10$-way classification task. (Center) The best $\tau$ value positively correlates with the average predicted probability of source models on target data. (Right) Detailed TTA results using the optimal $\tau$ across $15$ target domains, with exact values sourced from Fig. \ref{['abl for tau']} (Center). "NoAdapt" denotes the baseline using fixed source model parameters without adaptation, hence its performance remains consistent for both single-domain and continual TTA tasks. Values in brackets indicate the corresponding axis range.
  • Figure 4: (Left) Reward curves of DEM with varying $\alpha$ values for a $10$-way classification task. (Center) Effects of different $\alpha$ values on static and dynamic target data distribution shifts of single-domain and continual tasks. (Right) Detailed TTA results across $15$ target domains, using the optimal $\alpha=1.0$ for single-domain TTA tasks and $\alpha=1.3$ for continual TTA tasks. These $\alpha$ values are selected based on the ablation results in Fig. \ref{['abl_for_alpha']} (Center). The definitions of "NoAdapt" and values in brackets are consistent with Fig. \ref{['abl for tau']} (Right).
  • ...and 15 more figures

Theorems & Definitions (2)

  • Proposition A.1
  • proof