Low-Dimension-to-High-Dimension Generalization And Its Implications for Length Generalization
Yang Chen, Long Yang, Yitao Liang, Zhouchen Lin
TL;DR
The paper introduces the Low-Dimension-to-High-Dimension generalization ($LDHD$) framework to formalize length generalization as a structured OOD problem, proving a No-Free-Lunch limitation without suitable inductive bias. It analyzes how model architectures, notably Random Feature Models with Projection ($RFMP$) and Position-Only Linear Attention with Advice ($PLAA$), induce min-degree interpolators with respect to problem-specific bases, constraining LDHD generalization to targets aligned with these biases. The work then translates these insights into practical techniques for sequence models, including Chain-of-Thought as latent-space augmentation and principled position-embedding design, introducing the RPE-Square embedding that effectively handles data-format nuisances and enhances length generalization on tasks such as unaligned copy and URF addition. Overall, the LDHD perspective provides a principled path for designing inductive biases and embeddings to enable robust length generalization in reasoning and transformer-based models.
Abstract
Low-Dimension-to-High-Dimension (LDHD) generalization is a special case of Out-of-Distribution (OOD) generalization, where the training data are restricted to a low-dimensional subspace of the high-dimensional testing space. Assuming that each instance is generated from a latent variable and the dimension of the latent variable reflects the problem scale, the inherent scaling challenge in length generalization can be captured by the LDHD generalization in the latent space. We theoretically demonstrate that LDHD generalization is generally unattainable without exploiting prior knowledge to provide appropriate inductive bias. Specifically, we explore LDHD generalization in Boolean functions. We verify that different architectures trained with (S)GD converge to \emph{min-degree interpolators w.r.t. different independent sets}. LDHD generalization is achievable if and only if the target function coincides with this inductive bias. Applying the insights from LDHD generalization to length generalization, we explain the effectiveness of CoT as changing the structure latent space to enable better LDHD generalization. We also propose a principle for position embedding design to handle both the inherent LDHD generalization and the nuisances such as the data format. Following the principle, we propose a novel position embedding called RPE-Square that remedies the RPE for dealing with the data format nuisance.
