Symbolic Autoencoding for Self-Supervised Sequence Learning

Mohammad Hossein Amani; Nicolas Mario Baldwin; Amin Mansouri; Martin Josifoski; Maxime Peyrard; Robert West

Symbolic Autoencoding for Self-Supervised Sequence Learning

Mohammad Hossein Amani, Nicolas Mario Baldwin, Amin Mansouri, Martin Josifoski, Maxime Peyrard, Robert West

TL;DR

Symbolic autoencoding (ΣAE) tackles weakly supervised sequence learning by connecting two seq2seq models through a discrete bottleneck, enabling learning of bidirectional symbol mappings with limited parallel data and abundant unparallel data. The framework employs end-to-end gradient-based optimization using both supervised losses on parallel data and reconstruction losses through the discrete bottleneck, with surrogate gradient methods to handle non-differentiable components. It presents multiple discrete bottleneck implementations (Softmax, Gumbel, and VQ-DB) and practical techniques to mitigate hidden sequence collapse (EOS masking) as well as three training schedules to leverage mixed data sources. Empirical results on SCAN, PCFG SET, CFQ, and COGS demonstrate substantial gains in Z-space and robust unsupervised reconstruction, highlighting ΣAE’s potential for weakly supervised, cross-domain sequence transduction and symbolic reasoning tasks.

Abstract

Traditional language models, adept at next-token prediction in text sequences, often struggle with transduction tasks between distinct symbolic systems, particularly when parallel data is scarce. Addressing this issue, we introduce \textit{symbolic autoencoding} ($Σ$AE), a self-supervised framework that harnesses the power of abundant unparallel data alongside limited parallel data. $Σ$AE connects two generative models via a discrete bottleneck layer and is optimized end-to-end by minimizing reconstruction loss (simultaneously with supervised loss for the parallel data), such that the sequence generated by the discrete bottleneck can be read out as the transduced input sequence. We also develop gradient-based methods allowing for efficient self-supervised sequence learning despite the discreteness of the bottleneck. Our results demonstrate that $Σ$AE significantly enhances performance on transduction tasks, even with minimal parallel data, offering a promising solution for weakly supervised learning scenarios.

Symbolic Autoencoding for Self-Supervised Sequence Learning

TL;DR

Abstract

AE), a self-supervised framework that harnesses the power of abundant unparallel data alongside limited parallel data.

AE connects two generative models via a discrete bottleneck layer and is optimized end-to-end by minimizing reconstruction loss (simultaneously with supervised loss for the parallel data), such that the sequence generated by the discrete bottleneck can be read out as the transduced input sequence. We also develop gradient-based methods allowing for efficient self-supervised sequence learning despite the discreteness of the bottleneck. Our results demonstrate that

AE significantly enhances performance on transduction tasks, even with minimal parallel data, offering a promising solution for weakly supervised learning scenarios.

Paper Structure (21 sections, 7 equations, 6 figures, 17 tables)

This paper contains 21 sections, 7 equations, 6 figures, 17 tables.

Introduction
Preliminaries
Notation
Surrogate Gradients for Discrete Layers
$\Sigma$AE Framework
Discrete Bottleneck
Training Models with DB Head
Discrete bottleneck implementations
Remarks
Addressing Hidden Sequence Collapse in Seq2Seq Models
The Challenge: First Token Reliance
EOS Masking with Gradient Approximation
Training with Scheduling Strategies
Experimental Setup
Tasks
...and 6 more sections

Figures (6)

Figure 1: Illustration of the abstract flow of data in the symbolic autoencoding ($\Sigma$AE) framework, exemplified with the Rosetta Stone problem. Two sequence-to-sequence models ($M_{xz}$ and $M_{zx}$) are trained with both parallel data (the Rosetta Stone) through next-token prediction and unparallel data through connecting the models with a discrete bottleneck layer ($DB_x$ and $DB_z$) to autoencode each language using the other as its hidden representation.
Figure 2: Overview of symbolic autoencoding ($\Sigma$AE), illustrated on the Rosetta Stone problem (see Fig. \ref{['fig:abstract-flow']}). The discrete bottleneck (DB) generates two outputs: a score vector $\mathbf{s}$ and a quantized vector $\mathbf{v}_q$. When supervised labels $x^t$ are present, $M_{zx}$ is updated via negative log-likelihood loss $\mathcal{L}_{zx} = - \log(\mathbf{s}[x^t])$ on scores $\mathbf{s}$. Even without labels, the forward pass can continue through the unsupervised $z \to x \to z$ reconstruction path. The sequence of DB quantized vectors $\mathbf{v}_x^{<T_x}$ serve as input to $M_{xz}$ to reconstruct the original $z$ (the greek text in the figure). Subsequently, both models, $M_{zx}$ and $M_{xz}$, are updated jointly using the reconstruction loss $\mathcal{L}_{zxz}$, calculated as negative log-likelihood on the decoder's ($M_{xz}$) output scores.
Figure 3: Results for Softmax Discrete bottleneck -- Z Autoregressive Sentence Accuracy per Supervision Ratio ($\eta$). The blue line shows the accuracy of a model trained only on supervised data. At each $\eta$, one of our scheduling methods from Sec. \ref{['sec:framework_scheduling']} outperforms this supervised baseline. The accuracy of models improve with more supervised data, therefore the performance gaps tightens as the accuracies converge to their maxima. The rest of the performance metrics described in Sec. \ref{['sec:metrics']} are presented in Fig. \ref{['fig:softmaxdball']} for softmax DB, Fig. \ref{['fig:gumbeldball']} for Gumbel DB, and Fig. \ref{['fig:vqdball']} for VQ DB.
Figure 4: Softmax DB performance metrics
Figure 5: Gumbel DB performance metrics
...and 1 more figures

Symbolic Autoencoding for Self-Supervised Sequence Learning

TL;DR

Abstract

Symbolic Autoencoding for Self-Supervised Sequence Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)