Tripod: Three Complementary Inductive Biases for Disentangled Representation Learning
Kyle Hsu, Jubayer Ibn Hamid, Kaylee Burns, Chelsea Finn, Jiajun Wu
TL;DR
This paper tackles identifiability in unsupervised disentangled representation learning by proposing Tripod, which combines three complementary inductive biases—finite scalar latent quantization, kernel-based latent multiinformation regularization, and a normalized Hessian penalty—within a deterministic autoencoder. Each bias is adapted to overcome optimization challenges: fixed-codebook FSQ removes quantization learning losses, KDE-based KLM enables density-based multiinformation in deterministic settings, and NHP provides scale-invariant curvature regularization. Empirical results on four image-disentanglement benchmarks show state-of-the-art performance and demonstrate that all three legs are necessary, with ablations indicating substantial drops when any leg is removed or when naive combinations are used. The work highlights a practical path to stronger disentanglement by reengineering and combining existing inductive biases, at the cost of increased compute, and opens avenues for automatic quantization tuning and broader modality applications.
Abstract
Inductive biases are crucial in disentangled representation learning for narrowing down an underspecified solution set. In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature: data compression into a grid-like latent space via quantization, collective independence amongst latents, and minimal functional influence of any latent on how other latents determine data generation. In principle, these inductive biases are deeply complementary: they most directly specify properties of the latent space, encoder, and decoder, respectively. In practice, however, naively combining existing techniques instantiating these inductive biases fails to yield significant benefits. To address this, we propose adaptations to the three techniques that simplify the learning problem, equip key regularization terms with stabilizing invariances, and quash degenerate incentives. The resulting model, Tripod, achieves state-of-the-art results on a suite of four image disentanglement benchmarks. We also verify that Tripod significantly improves upon its naive incarnation and that all three of its "legs" are necessary for best performance.
