Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

David Perera; Victor Letzelter; Théo Mariotte; Adrien Cortés; Mickael Chen; Slim Essid; Gaël Richard

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

David Perera, Victor Letzelter, Théo Mariotte, Adrien Cortés, Mickael Chen, Slim Essid, Gaël Richard

TL;DR

Annealed Multiple Choice Learning (aMCL) addresses ambiguity in conditional prediction by injecting deterministic annealing into MCL. It replaces hard Winner-takes-all with a Boltzmann softmin $q_{T}$ controlled by a temperature schedule $T(t)$ and optimizes a soft distortion $D(q,f)$, guiding each hypothesis toward soft barycenters and allowing phase transitions as $T$ decreases. Theoretical analysis links the dynamics to entropy-constrained optimization and rate-distortion theory, predicting merges at high temperature and splits (phase transitions) into subgroups along the rate-distortion curve $R_x(D^ullet)$. Empirically, aMCL achieves competitive distortion and SI-SDR on UCI and speech separation benchmarks, with improved robustness to initialization and favorable computational complexity relative to PIT, suggesting strong practical utility for conditional quantization tasks.

Abstract

We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

TL;DR

Annealed Multiple Choice Learning (aMCL) addresses ambiguity in conditional prediction by injecting deterministic annealing into MCL. It replaces hard Winner-takes-all with a Boltzmann softmin

controlled by a temperature schedule

and optimizes a soft distortion

, guiding each hypothesis toward soft barycenters and allowing phase transitions as

decreases. Theoretical analysis links the dynamics to entropy-constrained optimization and rate-distortion theory, predicting merges at high temperature and splits (phase transitions) into subgroups along the rate-distortion curve

. Empirically, aMCL achieves competitive distortion and SI-SDR on UCI and speech separation benchmarks, with improved robustness to initialization and favorable computational complexity relative to PIT, suggesting strong practical utility for conditional quantization tasks.

Abstract

Paper Structure (41 sections, 10 theorems, 44 equations, 8 figures, 7 tables)

This paper contains 41 sections, 10 theorems, 44 equations, 8 figures, 7 tables.

Introduction
Related Work
Annealed Multiple Choice Learning
Winner-takes-all loss and its limitations
Combining deterministic annealing with Multiple Choice Learning
Theoretical analysis
Soft assignation and entropy constraints
Rate-distortion curve
Phase transitions
Experimental validation
UCI datasets
Setup
Results
Application to speech separation
Experimental setting
...and 26 more sections

Key Result

Proposition 1

The assignation eq:awta_assignation and optimization eq:awta_loss steps of aMCL correspond to an entropy-constrained block coordinate descent on the soft distortion.

Figures (8)

Figure 1: Overcoming limitations of Winner-Takes-All training with annealing. Illustrations of the test-time predictions on a Mixture of three Gaussians (green points) with $49$ hypotheses. Shaded blue circles represent the hypothesis predictions, with intensity corresponding to the predicted scores. (Left) Predictions of MCL as proposed in lee2016stochasticletzelter2023resilient. (Middle) Predictions of Relaxed WTA rupprecht2017learning with $\varepsilon = 0.1$. (Right) Annealed MCL with initial temperature $T_{0} = 0.6$. Each model was trained with the same backbone (a three-layer MLP). We see that WTA leaves out some hypotheses, achieving a higher quantization error than aMCL. Moreover, we see that Relaxed-WTA is biased toward the barycenter of the distribution, in contrast with aMCL.
Figure 2: Illustration of the training trajectory in the Rate-Distortion curve. Training trajectories of MCL (blue) and aMCL (red) in the case of a single Gaussian. The optimal reachable distortion ('+') is the distortion $D^\star$ satisfying $R(D^\star) = \log_{2}(n)$ (See Section \ref{['sec:rate_distortion']}).
Figure 3: Regimes in the distortion \ref{['eq:distortion']} vs. temperature training curve on the setup of Figure \ref{['fig:bifurcation']}. At first, the hypotheses converge to the conditional mean. It is followed by a plateau phase where performance stagnates. Transition begins at $T_{0}^c$: the hypotheses migrate toward the barycenter of each Gaussian. Then, they split and we observe a last phase transition. For reference, $2 \sigma^2$ is the critical temperature for a Gaussian with variance $\sigma^2$.
Figure 4: Conditional phase transitions with $t_{1} < t_{2} < t_{3} < t_{4}$. Results for a conditional version of the dataset in Figure \ref{['fig:amcl_comparison']}, where the Gaussian moves linearly and increases with the input $x \in [0,1]$. aMCL was trained during $1000$ epochs, using a linear scheduler with $T_{0} = 1.0$, and using $49$ hypotheses. Each subplot group corresponds to the predictions of the model at a given temperature, evaluated at different $x$ values. At temperature $T(t_1)$, the hypotheses are at the barycenter. As temperature decreases they undergo a first phase transition (at temperature $T(t_2)$ for $x=0.9$ and $T(t_3)$ for $x=0.6$), moving toward each Gaussian's barycenter, followed by a second phase transition at $T(t_4)$. We see that earlier splits in the cooling schedule correspond to conditional distributions with higher variances.
Figure 5: Training dynamics of Relaxed-WTA with $\varepsilon$ annealed. Predictions during training of Relaxed-WTA with $\varepsilon$ annealed linearly over 1000 epochs. We use the same setup as in Figure \ref{['fig:amcl_comparison']} of the main paper, with 49 hypotheses as shaded blue points (shade proportional to the predicted scores), and the targets as green points. For large $\varepsilon$, the hypotheses converge at the distribution barycenter. As $\varepsilon$ decreases, three Winners gradually move towards the barycenter of the Gaussians. When $\varepsilon$ approaches 0, the Winner-Take-All dynamic prevails and some hypotheses escape the barycenter to reach the Gaussian modes. However, the modes are quickly saturated, the unused hypotheses stop receiving gradient, and they remain stuck at the barycenter.
...and 3 more figures

Theorems & Definitions (16)

Proposition 1: Entropy-constrained alternated minimization
Proposition 2: aMCL training dynamic
Proposition 3: Rate-distortion properties
Proposition 4: First critical temperature
Theorem 2: Entropy constrained distortion minimization
Proposition 5: aMCL training dynamic
proof
Proposition 6: Additional properties of the distortion and the free energy
proof
Proposition 7: Rate-distortion properties
...and 6 more

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

TL;DR

Abstract

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (16)