Improving Discrete Optimisation Via Decoupled Straight-Through Estimator

Rushi Shah; Mingyuan Yan; Michael Curtis Mozer; Dianbo Liu

Improving Discrete Optimisation Via Decoupled Straight-Through Estimator

Rushi Shah, Mingyuan Yan, Michael Curtis Mozer, Dianbo Liu

TL;DR

Decoupled Straight-Through (Decoupled ST) is proposed, a minimal modification that introduces separate temperatures for the forward pass ($\tau_f$) and the backward pass ($\tau_b$), enabling independent tuning of exploration and gradient dispersion.

Abstract

The Straight-Through Estimator (STE) is the dominant method for training neural networks with discrete variables, enabling gradient-based optimisation by routing gradients through a differentiable surrogate. However, existing STE variants conflate two fundamentally distinct concerns: forward-pass stochasticity, which controls exploration and latent space utilisation, and backward-pass gradient dispersion i.e how learning signals are distributed across categories. We show that these concerns are qualitatively different and that tying them to a single temperature parameter leaves significant performance gains untapped. We propose Decoupled Straight-Through (Decoupled ST), a minimal modification that introduces separate temperatures for the forward pass ($τ_f$) and the backward pass ($τ_b$). This simple change enables independent tuning of exploration and gradient dispersion. Across three diverse tasks (Stochastic Binary Networks, Categorical Autoencoders, and Differentiable Logic Gate Networks), Decoupled ST consistently outperforms Identity STE, Softmax STE, and Straight-Through Gumbel-Softmax. Crucially, optimal $(τ_f, τ_b)$ configurations lie far off the diagonal $τ_f = τ_b$, confirming that the two concerns do require different answers and that single-temperature methods are fundamentally constrained.

Improving Discrete Optimisation Via Decoupled Straight-Through Estimator

TL;DR

Decoupled Straight-Through (Decoupled ST) is proposed, a minimal modification that introduces separate temperatures for the forward pass (

) and the backward pass (

), enabling independent tuning of exploration and gradient dispersion.

Abstract

) and the backward pass (

). This simple change enables independent tuning of exploration and gradient dispersion. Across three diverse tasks (Stochastic Binary Networks, Categorical Autoencoders, and Differentiable Logic Gate Networks), Decoupled ST consistently outperforms Identity STE, Softmax STE, and Straight-Through Gumbel-Softmax. Crucially, optimal

configurations lie far off the diagonal

, confirming that the two concerns do require different answers and that single-temperature methods are fundamentally constrained.

Improving Discrete Optimisation Via Decoupled Straight-Through Estimator

TL;DR

Abstract

Improving Discrete Optimisation Via Decoupled Straight-Through Estimator

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)