Table of Contents
Fetching ...

Bridging the Empirical-Theoretical Gap in Neural Network Formal Language Learning Using Minimum Description Length

Nur Lan, Emmanuel Chemla, Roni Katzir

TL;DR

The paper investigates why neural networks fail to generalize on formal languages despite theoretical capabilities, proposing the Minimum Description Length (MDL) objective as a bridge. By constructing a golden $a^nb^n$ LSTM and comparing it with networks trained by standard cross-entropy, it shows that MDL aligns with a true MDL-optimal solution while conventional objectives do not. Loss-surface analyses reveal that L1/L2 regularization can push models toward suboptimal optima, whereas MDL preserves the optimal solution in the local landscape, albeit with optimization challenges due to non-differentiability. The findings suggest MDL offers a principled path to provable generalization on formal-language tasks and motivate broader exploration across architectures.

Abstract

Neural networks offer good approximation to many tasks but consistently fail to reach perfect generalization, even when theoretical work shows that such perfect solutions can be expressed by certain architectures. Using the task of formal language learning, we focus on one simple formal language and show that the theoretically correct solution is in fact not an optimum of commonly used objectives -- even with regularization techniques that according to common wisdom should lead to simple weights and good generalization (L1, L2) or other meta-heuristics (early-stopping, dropout). On the other hand, replacing standard targets with the Minimum Description Length objective (MDL) results in the correct solution being an optimum.

Bridging the Empirical-Theoretical Gap in Neural Network Formal Language Learning Using Minimum Description Length

TL;DR

The paper investigates why neural networks fail to generalize on formal languages despite theoretical capabilities, proposing the Minimum Description Length (MDL) objective as a bridge. By constructing a golden LSTM and comparing it with networks trained by standard cross-entropy, it shows that MDL aligns with a true MDL-optimal solution while conventional objectives do not. Loss-surface analyses reveal that L1/L2 regularization can push models toward suboptimal optima, whereas MDL preserves the optimal solution in the local landscape, albeit with optimization challenges due to non-differentiability. The findings suggest MDL offers a principled path to provable generalization on formal-language tasks and motivate broader exploration across architectures.

Abstract

Neural networks offer good approximation to many tasks but consistently fail to reach perfect generalization, even when theoretical work shows that such perfect solutions can be expressed by certain architectures. Using the task of formal language learning, we focus on one simple formal language and show that the theoretically correct solution is in fact not an optimum of commonly used objectives -- even with regularization techniques that according to common wisdom should lead to simple weights and good generalization (L1, L2) or other meta-heuristics (early-stopping, dropout). On the other hand, replacing standard targets with the Minimum Description Length objective (MDL) results in the correct solution being an optimum.
Paper Structure (22 sections, 28 equations, 5 figures, 1 table)

This paper contains 22 sections, 28 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Loss surfaces around the golden $a^nb^n$ LSTM from Section \ref{['sec:experiment-golden-anbn']}, for the regularization terms considered here: L1, L2, and $|H|$ -- the hypothesis encoding length term of the MDL objective. $|H|$ is jagged and non-differentiable but results in the correct net being an optimum of the full loss function (Figure \ref{['fig:golden-losses-2d']}).
  • Figure 2: Output probabilities assigned by the golden network and the best trained network for $n=73$, the first point of failure of the trained network. Going left to right, each column represents the probability distribution outputted by the network at each time step.
  • Figure 3: Memory values for the golden network and the best trained network for $n=73$, the first point of failure of the trained network. Each line corresponds to one component of $c$, the memory vector in the LSTM cell, as the string is fed to the network.
  • Figure 4: Training loss around the golden $a^nb^n$ LSTM, and test accuracy scores for the golden network and the local minimum network. Optimizing using L1 or L2 (\ref{['fig:golden-losses-2d-l1']}, \ref{['fig:golden-losses-2d-l2']}) would result in suboptimal networks, while MDL (\ref{['fig:golden-losses-2d-mdl']}) results in alignment of the golden network with an optimum point of the loss.
  • Figure 5: Training loss surfaces and the test accuracy of the best $a^nb^n$ LSTM found through a hyper-param grid search, trained using backpropagation with the standard cross-enropy loss and early stopping based on validation loss. When evaluated using L1 or L2 (\ref{['fig:best-trained-losses-3d-l1']}, \ref{['fig:best-trained-losses-3d-l2']}) the network ranks better than the golden network but has worse test performance. When evaluated using MDL (\ref{['fig:best-trained-losses-3d-mdl']}) the trained network does not minimize the loss as well as the golden network, ending up in a smooth but suboptimal area of the loss space.