Symbolic Autoencoding for Self-Supervised Sequence Learning
Mohammad Hossein Amani, Nicolas Mario Baldwin, Amin Mansouri, Martin Josifoski, Maxime Peyrard, Robert West
TL;DR
Symbolic autoencoding (ΣAE) tackles weakly supervised sequence learning by connecting two seq2seq models through a discrete bottleneck, enabling learning of bidirectional symbol mappings with limited parallel data and abundant unparallel data. The framework employs end-to-end gradient-based optimization using both supervised losses on parallel data and reconstruction losses through the discrete bottleneck, with surrogate gradient methods to handle non-differentiable components. It presents multiple discrete bottleneck implementations (Softmax, Gumbel, and VQ-DB) and practical techniques to mitigate hidden sequence collapse (EOS masking) as well as three training schedules to leverage mixed data sources. Empirical results on SCAN, PCFG SET, CFQ, and COGS demonstrate substantial gains in Z-space and robust unsupervised reconstruction, highlighting ΣAE’s potential for weakly supervised, cross-domain sequence transduction and symbolic reasoning tasks.
Abstract
Traditional language models, adept at next-token prediction in text sequences, often struggle with transduction tasks between distinct symbolic systems, particularly when parallel data is scarce. Addressing this issue, we introduce \textit{symbolic autoencoding} ($Σ$AE), a self-supervised framework that harnesses the power of abundant unparallel data alongside limited parallel data. $Σ$AE connects two generative models via a discrete bottleneck layer and is optimized end-to-end by minimizing reconstruction loss (simultaneously with supervised loss for the parallel data), such that the sequence generated by the discrete bottleneck can be read out as the transduced input sequence. We also develop gradient-based methods allowing for efficient self-supervised sequence learning despite the discreteness of the bottleneck. Our results demonstrate that $Σ$AE significantly enhances performance on transduction tasks, even with minimal parallel data, offering a promising solution for weakly supervised learning scenarios.
